4787 lines (4786 with data), 190.3 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PrimeKG Subgraph Construction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we will showcase how to construct a subraph from PrimeKG and prepare necessary graph formats for further analysis.\n",
"\n",
"In particular, we will slice a subgraph from PrimeKG related to inflammatory bowel disease (IBD).\n",
"\n",
"The subgraph will contain all nodes and edges that are connected to IBD-related disease nodes, including the following relationships:\n",
"- Disease-Protein Relationship\n",
"- Disease-Disease Relationship (skipped as of now)\n",
"- Protein-Protein Relationship (skipped as of now)\n",
"- Drug-Protein Relationship\n",
"- Pathway-Protein Relationship\n",
"- Pathway-Pathway Relationship (skipped as of now)\n",
"- Bioprocess-Protein Relationship\n",
"- Molecular Function-Protein Relationship\n",
"- Cellular Component-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, to enrich the nodes and edges, we will perform the following tasks:\n",
"- Textual enrichment (only this task is implemented as of now) \n",
"- Multi-modal enrichment (to be added)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"First of all, we need to import necessary libraries as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"import networkx as nx\n",
"import pickle\n",
"from tqdm import tqdm\n",
"from torch_geometric.utils import from_networkx\n",
"import sys\n",
"sys.path.append('../../..')\n",
"from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG\n",
"from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG\n",
"from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama\n",
"from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils\n",
"\n",
"# # Set the logging level for httpx to WARNING to suppress INFO messages\n",
"import logging\n",
"logging.getLogger(\"httpx\").setLevel(logging.WARNING)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PrimeKG"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We utilize the `PrimeKG` class from the aiagents4pharma/talk2knowledgegraphs library.\n",
"\n",
"The `PrimeKG` needs to be initialized with the path to the PrimeKG dataset to be stored/loaded from the local directory."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading nodes of PrimeKG dataset ...\n",
"../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.\n",
"Loading edges of PrimeKG dataset ...\n",
"../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.\n"
]
}
],
"source": [
"# Define primekg data by providing a local directory where the data is stored\n",
"primekg_data = PrimeKG(local_dir=\"../../../../data/primekg/\")\n",
"\n",
"# Invoke a method to load the data\n",
"primekg_data.load_data()\n",
"\n",
"# Get primekg_nodes and primekg_edges\n",
"primekg_nodes = primekg_data.get_nodes()\n",
"primekg_edges = primekg_data.get_edges()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### IBD-related Data Filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### IBD-related Disease Nodes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a first step, we will perform data filtering over the primekg_nodes by querying the nodes that contains the following terms:\n",
"- inflammatory bowel disease\n",
"- crohn\n",
"- ulcerative colitis\n",
"\n",
"As of now, this basic query is used to filter the data. However, this can be replaced with a more complex query that can capture more nodes related to IBD."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_index</th>\n",
" <th>node_name</th>\n",
" <th>node_source</th>\n",
" <th>node_id</th>\n",
" <th>node_type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>27269</th>\n",
" <td>27269</td>\n",
" <td>IL21-related infantile inflammatory bowel disease</td>\n",
" <td>MONDO</td>\n",
" <td>14338</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28158</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29293</th>\n",
" <td>29293</td>\n",
" <td>inflammatory bowel disease, immunodeficiency, ...</td>\n",
" <td>MONDO</td>\n",
" <td>32601</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35814</th>\n",
" <td>35814</td>\n",
" <td>Crohn ileitis and jejunitis</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>709_21207</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35815</th>\n",
" <td>35815</td>\n",
" <td>small bowel Crohn disease</td>\n",
" <td>MONDO</td>\n",
" <td>5539</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37784</th>\n",
" <td>37784</td>\n",
" <td>Crohn disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>5011_5535</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37785</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39013</th>\n",
" <td>39013</td>\n",
" <td>immune dysregulation-inflammatory bowel diseas...</td>\n",
" <td>MONDO</td>\n",
" <td>16542</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39787</th>\n",
" <td>39787</td>\n",
" <td>immune dysregulation with inflammatory bowel d...</td>\n",
" <td>MONDO</td>\n",
" <td>33967</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83770</th>\n",
" <td>83770</td>\n",
" <td>Crohn's colitis</td>\n",
" <td>MONDO</td>\n",
" <td>5532</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95279</th>\n",
" <td>95279</td>\n",
" <td>Crohn jejunoileitis</td>\n",
" <td>MONDO</td>\n",
" <td>708</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95280</th>\n",
" <td>95280</td>\n",
" <td>gastroduodenal Crohn disease</td>\n",
" <td>MONDO</td>\n",
" <td>710</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97088</th>\n",
" <td>97088</td>\n",
" <td>perianal Crohn disease</td>\n",
" <td>MONDO</td>\n",
" <td>5537</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99325</th>\n",
" <td>99325</td>\n",
" <td>Crohn disease of the esophagus</td>\n",
" <td>MONDO</td>\n",
" <td>22901</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99680</th>\n",
" <td>99680</td>\n",
" <td>immune dysregulation-inflammatory bowel diseas...</td>\n",
" <td>MONDO</td>\n",
" <td>33968</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99681</th>\n",
" <td>99681</td>\n",
" <td>inflammatory bowel disease-recurrent sinopulmo...</td>\n",
" <td>MONDO</td>\n",
" <td>33969</td>\n",
" <td>disease</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" node_index node_name \\\n",
"27269 27269 IL21-related infantile inflammatory bowel disease \n",
"28158 28158 inflammatory bowel disease \n",
"29293 29293 inflammatory bowel disease, immunodeficiency, ... \n",
"35814 35814 Crohn ileitis and jejunitis \n",
"35815 35815 small bowel Crohn disease \n",
"37784 37784 Crohn disease \n",
"37785 37785 ulcerative colitis (disease) \n",
"39013 39013 immune dysregulation-inflammatory bowel diseas... \n",
"39787 39787 immune dysregulation with inflammatory bowel d... \n",
"83770 83770 Crohn's colitis \n",
"95279 95279 Crohn jejunoileitis \n",
"95280 95280 gastroduodenal Crohn disease \n",
"97088 97088 perianal Crohn disease \n",
"99325 99325 Crohn disease of the esophagus \n",
"99680 99680 immune dysregulation-inflammatory bowel diseas... \n",
"99681 99681 inflammatory bowel disease-recurrent sinopulmo... \n",
"\n",
" node_source node_id \\\n",
"27269 MONDO 14338 \n",
"28158 MONDO_grouped 9960_12845_33643_11471_12831_12875_12941_13153... \n",
"29293 MONDO 32601 \n",
"35814 MONDO_grouped 709_21207 \n",
"35815 MONDO 5539 \n",
"37784 MONDO_grouped 5011_5535 \n",
"37785 MONDO 5101 \n",
"39013 MONDO 16542 \n",
"39787 MONDO 33967 \n",
"83770 MONDO 5532 \n",
"95279 MONDO 708 \n",
"95280 MONDO 710 \n",
"97088 MONDO 5537 \n",
"99325 MONDO 22901 \n",
"99680 MONDO 33968 \n",
"99681 MONDO 33969 \n",
"\n",
" node_type \n",
"27269 disease \n",
"28158 disease \n",
"29293 disease \n",
"35814 disease \n",
"35815 disease \n",
"37784 disease \n",
"37785 disease \n",
"39013 disease \n",
"39787 disease \n",
"83770 disease \n",
"95279 disease \n",
"95280 disease \n",
"97088 disease \n",
"99325 disease \n",
"99680 disease \n",
"99681 disease "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Query for nodes related to IBD\n",
"query_str = 'node_name_lower.str.contains(\"inflammatory bowel disease\")'\n",
"query_str += 'or node_name_lower.str.contains(\"crohn\")'\n",
"query_str += 'or node_name_lower.str.contains(\"ulcerative colitis\")'\n",
"\n",
"# Get the nodes related to IBD\n",
"ibd_nodes_df = primekg_nodes.copy()\n",
"ibd_nodes_df[\"node_name_lower\"] = primekg_nodes.node_name.apply(lambda x: x.lower())\n",
"ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == \"disease\"].query(query_str, engine='python')\n",
"ibd_nodes_df.drop(columns=[\"node_name_lower\"], inplace=True)\n",
"ibd_nodes_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Disease-Protein Relationship\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the nodes related to IBD, we can further capture the records containing the relationships of disease-gene/protein nodes."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5988787</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>NCBI</td>\n",
" <td>113</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5988788</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>NCBI</td>\n",
" <td>113</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5988789</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>NCBI</td>\n",
" <td>639</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5988790</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>NCBI</td>\n",
" <td>639</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5988791</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>2712</td>\n",
" <td>CASP3</td>\n",
" <td>NCBI</td>\n",
" <td>836</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3304471</th>\n",
" <td>34780</td>\n",
" <td>IRGM</td>\n",
" <td>NCBI</td>\n",
" <td>345611</td>\n",
" <td>gene/protein</td>\n",
" <td>35814</td>\n",
" <td>Crohn ileitis and jejunitis</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>709_21207</td>\n",
" <td>disease</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3310277</th>\n",
" <td>5022</td>\n",
" <td>ITGAM</td>\n",
" <td>NCBI</td>\n",
" <td>3684</td>\n",
" <td>gene/protein</td>\n",
" <td>35814</td>\n",
" <td>Crohn ileitis and jejunitis</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>709_21207</td>\n",
" <td>disease</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3313160</th>\n",
" <td>2889</td>\n",
" <td>TGFB1</td>\n",
" <td>NCBI</td>\n",
" <td>7040</td>\n",
" <td>gene/protein</td>\n",
" <td>29293</td>\n",
" <td>inflammatory bowel disease, immunodeficiency, ...</td>\n",
" <td>MONDO</td>\n",
" <td>32601</td>\n",
" <td>disease</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3314800</th>\n",
" <td>9104</td>\n",
" <td>INAVA</td>\n",
" <td>NCBI</td>\n",
" <td>55765</td>\n",
" <td>gene/protein</td>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3314949</th>\n",
" <td>34967</td>\n",
" <td>IL21</td>\n",
" <td>NCBI</td>\n",
" <td>59067</td>\n",
" <td>gene/protein</td>\n",
" <td>27269</td>\n",
" <td>IL21-related infantile inflammatory bowel disease</td>\n",
" <td>MONDO</td>\n",
" <td>14338</td>\n",
" <td>disease</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>620 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source \\\n",
"5988787 37785 ulcerative colitis (disease) MONDO \n",
"5988788 28158 inflammatory bowel disease MONDO_grouped \n",
"5988789 37785 ulcerative colitis (disease) MONDO \n",
"5988790 28158 inflammatory bowel disease MONDO_grouped \n",
"5988791 37785 ulcerative colitis (disease) MONDO \n",
"... ... ... ... \n",
"3304471 34780 IRGM NCBI \n",
"3310277 5022 ITGAM NCBI \n",
"3313160 2889 TGFB1 NCBI \n",
"3314800 9104 INAVA NCBI \n",
"3314949 34967 IL21 NCBI \n",
"\n",
" head_id head_type \\\n",
"5988787 5101 disease \n",
"5988788 9960_12845_33643_11471_12831_12875_12941_13153... disease \n",
"5988789 5101 disease \n",
"5988790 9960_12845_33643_11471_12831_12875_12941_13153... disease \n",
"5988791 5101 disease \n",
"... ... ... \n",
"3304471 345611 gene/protein \n",
"3310277 3684 gene/protein \n",
"3313160 7040 gene/protein \n",
"3314800 55765 gene/protein \n",
"3314949 59067 gene/protein \n",
"\n",
" tail_index tail_name \\\n",
"5988787 7359 ADCY7 \n",
"5988788 7359 ADCY7 \n",
"5988789 2874 PRDM1 \n",
"5988790 2874 PRDM1 \n",
"5988791 2712 CASP3 \n",
"... ... ... \n",
"3304471 35814 Crohn ileitis and jejunitis \n",
"3310277 35814 Crohn ileitis and jejunitis \n",
"3313160 29293 inflammatory bowel disease, immunodeficiency, ... \n",
"3314800 28158 inflammatory bowel disease \n",
"3314949 27269 IL21-related infantile inflammatory bowel disease \n",
"\n",
" tail_source tail_id \\\n",
"5988787 NCBI 113 \n",
"5988788 NCBI 113 \n",
"5988789 NCBI 639 \n",
"5988790 NCBI 639 \n",
"5988791 NCBI 836 \n",
"... ... ... \n",
"3304471 MONDO_grouped 709_21207 \n",
"3310277 MONDO_grouped 709_21207 \n",
"3313160 MONDO 32601 \n",
"3314800 MONDO_grouped 9960_12845_33643_11471_12831_12875_12941_13153... \n",
"3314949 MONDO 14338 \n",
"\n",
" tail_type display_relation relation \n",
"5988787 gene/protein associated with disease_protein \n",
"5988788 gene/protein associated with disease_protein \n",
"5988789 gene/protein associated with disease_protein \n",
"5988790 gene/protein associated with disease_protein \n",
"5988791 gene/protein associated with disease_protein \n",
"... ... ... ... \n",
"3304471 disease associated with disease_protein \n",
"3310277 disease associated with disease_protein \n",
"3313160 disease associated with disease_protein \n",
"3314800 disease associated with disease_protein \n",
"3314949 disease associated with disease_protein \n",
"\n",
"[620 rows x 12 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# IBD disease_protein edges\n",
"ibd_disease_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & \n",
" (primekg_edges.tail_type == 'gene/protein')],\n",
" primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & \n",
" (primekg_edges.head_type == 'gene/protein')]])\n",
"\n",
"# Check dataframe\n",
"ibd_disease_protein_edges_df"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 144, 179, 192, 279, 417, 625, 657, 729, 772,\n",
" 989, 1004, 1122, 1299, 1480, 1567, 1618, 1654, 1777,\n",
" 1990, 2012, 2057, 2078, 2111, 2139, 2329, 2384, 2543,\n",
" 2643, 2712, 2749, 2874, 2889, 2978, 2983, 3064, 3088,\n",
" 3233, 3259, 3333, 3414, 3460, 3469, 3474, 3484, 3495,\n",
" 3578, 3646, 4152, 4162, 4731, 4818, 4968, 4997, 5022,\n",
" 5195, 5385, 5720, 5805, 5915, 6168, 6175, 6229, 6428,\n",
" 6661, 7059, 7083, 7359, 7384, 7899, 7958, 8030, 8564,\n",
" 9104, 9454, 9763, 10113, 10191, 10919, 11103, 11134, 11199,\n",
" 11523, 11588, 12305, 12663, 12740, 12763, 12816, 13014, 13365,\n",
" 21972, 22105, 34623, 34776, 34777, 34778, 34779, 34780, 34781,\n",
" 34814, 34887, 34967, 35156])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get unique protein index\n",
"ibd_protein_index = np.unique(np.concatenate([ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.head_type == 'gene/protein'].head_index.unique(),\n",
" ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.tail_type == 'gene/protein'].tail_index.unique()]))\n",
"ibd_protein_index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Disease-Disease Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we can get the records containing the relationships of disease-disease nodes."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# # IBD disease_disease edges \n",
"# ibd_disease_disease_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & \n",
"# (primekg_edges.tail_type == 'disease')],\n",
"# primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & \n",
"# (primekg_edges.head_type == 'disease')]])\n",
"\n",
"# # Check dataframe\n",
"# ibd_disease_disease_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Protein-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also can get the records containing the relationships of gene/protein-gene/protein nodes."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# # IBD protein_protein edges \n",
"# ibd_protein_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_protein_index)) & \n",
"# (primekg_edges.tail_type == 'gene/protein')],\n",
"# primekg_edges[(primekg_edges.tail_index.isin(ibd_protein_index)) & \n",
"# (primekg_edges.head_type == 'gene/protein')]])\n",
"\n",
"# # Check dataframe\n",
"# ibd_protein_protein_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Drug-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will get the records containing the relationships of drug-gene/protein nodes."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>321759</th>\n",
" <td>14118</td>\n",
" <td>Rose bengal</td>\n",
" <td>DrugBank</td>\n",
" <td>DB11182</td>\n",
" <td>drug</td>\n",
" <td>3233</td>\n",
" <td>LTF</td>\n",
" <td>NCBI</td>\n",
" <td>4057</td>\n",
" <td>gene/protein</td>\n",
" <td>carrier</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>321763</th>\n",
" <td>14038</td>\n",
" <td>Fluticasone furoate</td>\n",
" <td>DrugBank</td>\n",
" <td>DB08906</td>\n",
" <td>drug</td>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>carrier</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>321764</th>\n",
" <td>14555</td>\n",
" <td>Technetium Tc-99m tetrofosmin</td>\n",
" <td>DrugBank</td>\n",
" <td>DB09160</td>\n",
" <td>drug</td>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>carrier</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>321765</th>\n",
" <td>14040</td>\n",
" <td>Fluticasone</td>\n",
" <td>DrugBank</td>\n",
" <td>DB13867</td>\n",
" <td>drug</td>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>carrier</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>322373</th>\n",
" <td>14060</td>\n",
" <td>Levothyroxine</td>\n",
" <td>DrugBank</td>\n",
" <td>DB00451</td>\n",
" <td>drug</td>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>enzyme</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5731639</th>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>14498</td>\n",
" <td>Risdiplam</td>\n",
" <td>DrugBank</td>\n",
" <td>DB15305</td>\n",
" <td>drug</td>\n",
" <td>transporter</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5731640</th>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>14908</td>\n",
" <td>Ubrogepant</td>\n",
" <td>DrugBank</td>\n",
" <td>DB15328</td>\n",
" <td>drug</td>\n",
" <td>transporter</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5731641</th>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>14499</td>\n",
" <td>Elexacaftor</td>\n",
" <td>DrugBank</td>\n",
" <td>DB15444</td>\n",
" <td>drug</td>\n",
" <td>transporter</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5731642</th>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>14050</td>\n",
" <td>Prednisolone acetate</td>\n",
" <td>DrugBank</td>\n",
" <td>DB15566</td>\n",
" <td>drug</td>\n",
" <td>transporter</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5731643</th>\n",
" <td>4152</td>\n",
" <td>ABCB1</td>\n",
" <td>NCBI</td>\n",
" <td>5243</td>\n",
" <td>gene/protein</td>\n",
" <td>15752</td>\n",
" <td>Selpercatinib</td>\n",
" <td>DrugBank</td>\n",
" <td>DB15685</td>\n",
" <td>drug</td>\n",
" <td>transporter</td>\n",
" <td>drug_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2030 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source head_id \\\n",
"321759 14118 Rose bengal DrugBank DB11182 \n",
"321763 14038 Fluticasone furoate DrugBank DB08906 \n",
"321764 14555 Technetium Tc-99m tetrofosmin DrugBank DB09160 \n",
"321765 14040 Fluticasone DrugBank DB13867 \n",
"322373 14060 Levothyroxine DrugBank DB00451 \n",
"... ... ... ... ... \n",
"5731639 4152 ABCB1 NCBI 5243 \n",
"5731640 4152 ABCB1 NCBI 5243 \n",
"5731641 4152 ABCB1 NCBI 5243 \n",
"5731642 4152 ABCB1 NCBI 5243 \n",
"5731643 4152 ABCB1 NCBI 5243 \n",
"\n",
" head_type tail_index tail_name tail_source tail_id \\\n",
"321759 drug 3233 LTF NCBI 4057 \n",
"321763 drug 4152 ABCB1 NCBI 5243 \n",
"321764 drug 4152 ABCB1 NCBI 5243 \n",
"321765 drug 4152 ABCB1 NCBI 5243 \n",
"322373 drug 4152 ABCB1 NCBI 5243 \n",
"... ... ... ... ... ... \n",
"5731639 gene/protein 14498 Risdiplam DrugBank DB15305 \n",
"5731640 gene/protein 14908 Ubrogepant DrugBank DB15328 \n",
"5731641 gene/protein 14499 Elexacaftor DrugBank DB15444 \n",
"5731642 gene/protein 14050 Prednisolone acetate DrugBank DB15566 \n",
"5731643 gene/protein 15752 Selpercatinib DrugBank DB15685 \n",
"\n",
" tail_type display_relation relation \n",
"321759 gene/protein carrier drug_protein \n",
"321763 gene/protein carrier drug_protein \n",
"321764 gene/protein carrier drug_protein \n",
"321765 gene/protein carrier drug_protein \n",
"322373 gene/protein enzyme drug_protein \n",
"... ... ... ... \n",
"5731639 drug transporter drug_protein \n",
"5731640 drug transporter drug_protein \n",
"5731641 drug transporter drug_protein \n",
"5731642 drug transporter drug_protein \n",
"5731643 drug transporter drug_protein \n",
"\n",
"[2030 rows x 12 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# IBD drug_protein edges\n",
"ibd_drug_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'drug') & \n",
" (primekg_edges.tail_type == 'gene/protein') & \n",
" (primekg_edges.tail_index.isin(ibd_protein_index))], \n",
" primekg_edges[(primekg_edges.tail_type == 'drug') & \n",
" (primekg_edges.head_type == 'gene/protein') & \n",
" (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
"\n",
"# Check dataframe\n",
"ibd_drug_protein_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Pathway-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this case, we will get the records containing the relationships of pathway-protein nodes."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6505784</th>\n",
" <td>62703</td>\n",
" <td>Adherens junctions interactions</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-418990</td>\n",
" <td>pathway</td>\n",
" <td>8030</td>\n",
" <td>CDH3</td>\n",
" <td>NCBI</td>\n",
" <td>1001</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6506102</th>\n",
" <td>128079</td>\n",
" <td>Regulation of actin dynamics for phagocytic cu...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-2029482</td>\n",
" <td>pathway</td>\n",
" <td>2139</td>\n",
" <td>ARPC2</td>\n",
" <td>NCBI</td>\n",
" <td>10109</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6506103</th>\n",
" <td>128183</td>\n",
" <td>EPHB-mediated forward signaling</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-3928662</td>\n",
" <td>pathway</td>\n",
" <td>2139</td>\n",
" <td>ARPC2</td>\n",
" <td>NCBI</td>\n",
" <td>10109</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6506104</th>\n",
" <td>128022</td>\n",
" <td>RHO GTPases Activate WASPs and WAVEs</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-5663213</td>\n",
" <td>pathway</td>\n",
" <td>2139</td>\n",
" <td>ARPC2</td>\n",
" <td>NCBI</td>\n",
" <td>10109</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6506105</th>\n",
" <td>62931</td>\n",
" <td>Clathrin-mediated endocytosis</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-8856828</td>\n",
" <td>pathway</td>\n",
" <td>2139</td>\n",
" <td>ARPC2</td>\n",
" <td>NCBI</td>\n",
" <td>10109</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3834665</th>\n",
" <td>2543</td>\n",
" <td>CDH1</td>\n",
" <td>NCBI</td>\n",
" <td>999</td>\n",
" <td>gene/protein</td>\n",
" <td>127731</td>\n",
" <td>Integrin cell surface interactions</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-216083</td>\n",
" <td>pathway</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3834666</th>\n",
" <td>2543</td>\n",
" <td>CDH1</td>\n",
" <td>NCBI</td>\n",
" <td>999</td>\n",
" <td>gene/protein</td>\n",
" <td>127617</td>\n",
" <td>Apoptotic cleavage of cell adhesion proteins</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-351906</td>\n",
" <td>pathway</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3834667</th>\n",
" <td>2543</td>\n",
" <td>CDH1</td>\n",
" <td>NCBI</td>\n",
" <td>999</td>\n",
" <td>gene/protein</td>\n",
" <td>62703</td>\n",
" <td>Adherens junctions interactions</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-418990</td>\n",
" <td>pathway</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3834668</th>\n",
" <td>2543</td>\n",
" <td>CDH1</td>\n",
" <td>NCBI</td>\n",
" <td>999</td>\n",
" <td>gene/protein</td>\n",
" <td>128018</td>\n",
" <td>RHO GTPases activate IQGAPs</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-5626467</td>\n",
" <td>pathway</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3834669</th>\n",
" <td>2543</td>\n",
" <td>CDH1</td>\n",
" <td>NCBI</td>\n",
" <td>999</td>\n",
" <td>gene/protein</td>\n",
" <td>129039</td>\n",
" <td>InlA-mediated entry of Listeria monocytogenes ...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-8876493</td>\n",
" <td>pathway</td>\n",
" <td>interacts with</td>\n",
" <td>pathway_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1030 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name \\\n",
"6505784 62703 Adherens junctions interactions \n",
"6506102 128079 Regulation of actin dynamics for phagocytic cu... \n",
"6506103 128183 EPHB-mediated forward signaling \n",
"6506104 128022 RHO GTPases Activate WASPs and WAVEs \n",
"6506105 62931 Clathrin-mediated endocytosis \n",
"... ... ... \n",
"3834665 2543 CDH1 \n",
"3834666 2543 CDH1 \n",
"3834667 2543 CDH1 \n",
"3834668 2543 CDH1 \n",
"3834669 2543 CDH1 \n",
"\n",
" head_source head_id head_type tail_index \\\n",
"6505784 REACTOME R-HSA-418990 pathway 8030 \n",
"6506102 REACTOME R-HSA-2029482 pathway 2139 \n",
"6506103 REACTOME R-HSA-3928662 pathway 2139 \n",
"6506104 REACTOME R-HSA-5663213 pathway 2139 \n",
"6506105 REACTOME R-HSA-8856828 pathway 2139 \n",
"... ... ... ... ... \n",
"3834665 NCBI 999 gene/protein 127731 \n",
"3834666 NCBI 999 gene/protein 127617 \n",
"3834667 NCBI 999 gene/protein 62703 \n",
"3834668 NCBI 999 gene/protein 128018 \n",
"3834669 NCBI 999 gene/protein 129039 \n",
"\n",
" tail_name tail_source \\\n",
"6505784 CDH3 NCBI \n",
"6506102 ARPC2 NCBI \n",
"6506103 ARPC2 NCBI \n",
"6506104 ARPC2 NCBI \n",
"6506105 ARPC2 NCBI \n",
"... ... ... \n",
"3834665 Integrin cell surface interactions REACTOME \n",
"3834666 Apoptotic cleavage of cell adhesion proteins REACTOME \n",
"3834667 Adherens junctions interactions REACTOME \n",
"3834668 RHO GTPases activate IQGAPs REACTOME \n",
"3834669 InlA-mediated entry of Listeria monocytogenes ... REACTOME \n",
"\n",
" tail_id tail_type display_relation relation \n",
"6505784 1001 gene/protein interacts with pathway_protein \n",
"6506102 10109 gene/protein interacts with pathway_protein \n",
"6506103 10109 gene/protein interacts with pathway_protein \n",
"6506104 10109 gene/protein interacts with pathway_protein \n",
"6506105 10109 gene/protein interacts with pathway_protein \n",
"... ... ... ... ... \n",
"3834665 R-HSA-216083 pathway interacts with pathway_protein \n",
"3834666 R-HSA-351906 pathway interacts with pathway_protein \n",
"3834667 R-HSA-418990 pathway interacts with pathway_protein \n",
"3834668 R-HSA-5626467 pathway interacts with pathway_protein \n",
"3834669 R-HSA-8876493 pathway interacts with pathway_protein \n",
"\n",
"[1030 rows x 12 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# IBD pathway_protein edges \n",
"ibd_pathway_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'pathway') & \n",
" (primekg_edges.tail_type == 'gene/protein') & \n",
" (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
" primekg_edges[(primekg_edges.tail_type == 'pathway') & \n",
" (primekg_edges.head_type == 'gene/protein') & \n",
" (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
"\n",
"# Check dataframe\n",
"ibd_pathway_protein_edges_df"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 62341, 62347, 62348, 62373, 62376, 62394, 62400, 62401,\n",
" 62404, 62405, 62414, 62448, 62449, 62462, 62465, 62467,\n",
" 62469, 62472, 62476, 62477, 62483, 62543, 62571, 62573,\n",
" 62575, 62583, 62588, 62596, 62603, 62606, 62628, 62644,\n",
" 62651, 62655, 62657, 62672, 62675, 62691, 62692, 62697,\n",
" 62702, 62703, 62711, 62717, 62733, 62734, 62768, 62770,\n",
" 62805, 62807, 62836, 62865, 62916, 62925, 62931, 62968,\n",
" 62976, 62987, 62996, 63041, 63064, 63071, 63076, 127601,\n",
" 127615, 127616, 127617, 127619, 127620, 127624, 127628, 127629,\n",
" 127639, 127640, 127649, 127659, 127662, 127682, 127683, 127688,\n",
" 127691, 127693, 127694, 127695, 127696, 127726, 127727, 127728,\n",
" 127729, 127730, 127731, 127732, 127733, 127791, 127797, 127810,\n",
" 127814, 127815, 127833, 127835, 127856, 127858, 127866, 127867,\n",
" 127869, 127886, 127891, 127908, 127917, 127918, 127921, 127928,\n",
" 127958, 127960, 127971, 127977, 127999, 128001, 128002, 128003,\n",
" 128008, 128010, 128015, 128018, 128022, 128025, 128034, 128058,\n",
" 128065, 128071, 128072, 128073, 128074, 128078, 128079, 128080,\n",
" 128086, 128111, 128113, 128116, 128117, 128129, 128137, 128138,\n",
" 128139, 128158, 128165, 128170, 128176, 128183, 128186, 128191,\n",
" 128198, 128199, 128204, 128208, 128209, 128224, 128227, 128242,\n",
" 128243, 128244, 128253, 128254, 128270, 128271, 128272, 128273,\n",
" 128299, 128302, 128341, 128348, 128349, 128350, 128351, 128353,\n",
" 128360, 128378, 128381, 128393, 128395, 128396, 128399, 128430,\n",
" 128440, 128453, 128460, 128470, 128472, 128473, 128477, 128478,\n",
" 128479, 128480, 128481, 128482, 128483, 128484, 128486, 128487,\n",
" 128497, 128498, 128499, 128500, 128501, 128503, 128527, 128535,\n",
" 128550, 128593, 128599, 128601, 128602, 128604, 128655, 128677,\n",
" 128715, 128759, 128766, 128767, 128779, 128781, 128782, 128783,\n",
" 128784, 128789, 128792, 128801, 128804, 128814, 128815, 128827,\n",
" 128828, 128829, 128830, 128832, 128835, 128837, 128838, 128841,\n",
" 128846, 128851, 128852, 128878, 128976, 128977, 128978, 128979,\n",
" 128980, 128981, 128988, 128990, 129007, 129015, 129016, 129021,\n",
" 129023, 129035, 129039, 129040, 129042, 129044, 129047, 129048,\n",
" 129052, 129099, 129110, 129124, 129125, 129126, 129127, 129128,\n",
" 129131, 129135, 129136, 129139, 129140, 129141, 129148, 129155,\n",
" 129167, 129181, 129183, 129190, 129195, 129196, 129197, 129198,\n",
" 129215, 129217, 129238, 129257, 129258, 129259, 129264, 129266,\n",
" 129289, 129294, 129296, 129302, 129303, 129310, 129355, 129360,\n",
" 129361, 129365, 129366, 129367])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get unique protein index\n",
"ibd_pathway_index = np.unique(np.concatenate([ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.head_type == 'pathway'].head_index.unique(),\n",
" ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.tail_type == 'pathway'].tail_index.unique()]))\n",
"ibd_pathway_index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Pathway-Pathway Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As well as, a set of records containing the relationships of pathway-pathway nodes."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# # # IBD pathway_pathway edges \n",
"# ibd_pathway_pathway_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_pathway_index)) & \n",
"# (primekg_edges.tail_type == 'pathway')],\n",
"# primekg_edges[(primekg_edges.tail_index.isin(ibd_pathway_index)) & \n",
"# (primekg_edges.head_type == 'pathway')]])\n",
"\n",
"# # Check dataframe\n",
"# ibd_pathway_pathway_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Bioprocess-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next step is to get the records containing the relationships of biological_process-gene/protein nodes."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6351294</th>\n",
" <td>112487</td>\n",
" <td>neutrophil degranulation</td>\n",
" <td>GO</td>\n",
" <td>43312</td>\n",
" <td>biological_process</td>\n",
" <td>1990</td>\n",
" <td>FCGR2A</td>\n",
" <td>NCBI</td>\n",
" <td>2212</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6351300</th>\n",
" <td>112487</td>\n",
" <td>neutrophil degranulation</td>\n",
" <td>GO</td>\n",
" <td>43312</td>\n",
" <td>biological_process</td>\n",
" <td>3333</td>\n",
" <td>FPR2</td>\n",
" <td>NCBI</td>\n",
" <td>2358</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6351340</th>\n",
" <td>112487</td>\n",
" <td>neutrophil degranulation</td>\n",
" <td>GO</td>\n",
" <td>43312</td>\n",
" <td>biological_process</td>\n",
" <td>2012</td>\n",
" <td>CXCR1</td>\n",
" <td>NCBI</td>\n",
" <td>3577</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6351341</th>\n",
" <td>112487</td>\n",
" <td>neutrophil degranulation</td>\n",
" <td>GO</td>\n",
" <td>43312</td>\n",
" <td>biological_process</td>\n",
" <td>3064</td>\n",
" <td>CXCR2</td>\n",
" <td>NCBI</td>\n",
" <td>3579</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6351346</th>\n",
" <td>112487</td>\n",
" <td>neutrophil degranulation</td>\n",
" <td>GO</td>\n",
" <td>43312</td>\n",
" <td>biological_process</td>\n",
" <td>5022</td>\n",
" <td>ITGAM</td>\n",
" <td>NCBI</td>\n",
" <td>3684</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3781707</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>51599</td>\n",
" <td>negative regulation of peroxidase activity</td>\n",
" <td>GO</td>\n",
" <td>2000469</td>\n",
" <td>biological_process</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3781708</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>52358</td>\n",
" <td>regulation of kidney size</td>\n",
" <td>GO</td>\n",
" <td>35564</td>\n",
" <td>biological_process</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3781710</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>109343</td>\n",
" <td>negative regulation of thioredoxin peroxidase ...</td>\n",
" <td>GO</td>\n",
" <td>1903125</td>\n",
" <td>biological_process</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3781811</th>\n",
" <td>22105</td>\n",
" <td>GPBAR1</td>\n",
" <td>NCBI</td>\n",
" <td>151306</td>\n",
" <td>gene/protein</td>\n",
" <td>105254</td>\n",
" <td>cell surface bile acid receptor signaling pathway</td>\n",
" <td>GO</td>\n",
" <td>38184</td>\n",
" <td>biological_process</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3781824</th>\n",
" <td>34779</td>\n",
" <td>NKX2-3</td>\n",
" <td>NCBI</td>\n",
" <td>159296</td>\n",
" <td>gene/protein</td>\n",
" <td>100699</td>\n",
" <td>post-embryonic digestive tract morphogenesis</td>\n",
" <td>GO</td>\n",
" <td>48621</td>\n",
" <td>biological_process</td>\n",
" <td>interacts with</td>\n",
" <td>bioprocess_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6300 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source head_id \\\n",
"6351294 112487 neutrophil degranulation GO 43312 \n",
"6351300 112487 neutrophil degranulation GO 43312 \n",
"6351340 112487 neutrophil degranulation GO 43312 \n",
"6351341 112487 neutrophil degranulation GO 43312 \n",
"6351346 112487 neutrophil degranulation GO 43312 \n",
"... ... ... ... ... \n",
"3781707 2111 LRRK2 NCBI 120892 \n",
"3781708 2111 LRRK2 NCBI 120892 \n",
"3781710 2111 LRRK2 NCBI 120892 \n",
"3781811 22105 GPBAR1 NCBI 151306 \n",
"3781824 34779 NKX2-3 NCBI 159296 \n",
"\n",
" head_type tail_index \\\n",
"6351294 biological_process 1990 \n",
"6351300 biological_process 3333 \n",
"6351340 biological_process 2012 \n",
"6351341 biological_process 3064 \n",
"6351346 biological_process 5022 \n",
"... ... ... \n",
"3781707 gene/protein 51599 \n",
"3781708 gene/protein 52358 \n",
"3781710 gene/protein 109343 \n",
"3781811 gene/protein 105254 \n",
"3781824 gene/protein 100699 \n",
"\n",
" tail_name tail_source \\\n",
"6351294 FCGR2A NCBI \n",
"6351300 FPR2 NCBI \n",
"6351340 CXCR1 NCBI \n",
"6351341 CXCR2 NCBI \n",
"6351346 ITGAM NCBI \n",
"... ... ... \n",
"3781707 negative regulation of peroxidase activity GO \n",
"3781708 regulation of kidney size GO \n",
"3781710 negative regulation of thioredoxin peroxidase ... GO \n",
"3781811 cell surface bile acid receptor signaling pathway GO \n",
"3781824 post-embryonic digestive tract morphogenesis GO \n",
"\n",
" tail_id tail_type display_relation relation \n",
"6351294 2212 gene/protein interacts with bioprocess_protein \n",
"6351300 2358 gene/protein interacts with bioprocess_protein \n",
"6351340 3577 gene/protein interacts with bioprocess_protein \n",
"6351341 3579 gene/protein interacts with bioprocess_protein \n",
"6351346 3684 gene/protein interacts with bioprocess_protein \n",
"... ... ... ... ... \n",
"3781707 2000469 biological_process interacts with bioprocess_protein \n",
"3781708 35564 biological_process interacts with bioprocess_protein \n",
"3781710 1903125 biological_process interacts with bioprocess_protein \n",
"3781811 38184 biological_process interacts with bioprocess_protein \n",
"3781824 48621 biological_process interacts with bioprocess_protein \n",
"\n",
"[6300 rows x 12 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# IBD bioprocess_protein edges \n",
"ibd_bioprocess_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'biological_process') & \n",
" (primekg_edges.tail_type == 'gene/protein') & \n",
" (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
" primekg_edges[(primekg_edges.tail_type == 'biological_process') & \n",
" (primekg_edges.head_type == 'gene/protein') & \n",
" (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
"\n",
"# Check dataframe\n",
"ibd_bioprocess_protein_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### MolFunc-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we would like to get biological_process-gene/protein relationships."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6198264</th>\n",
" <td>54035</td>\n",
" <td>interleukin-1 binding</td>\n",
" <td>GO</td>\n",
" <td>19966</td>\n",
" <td>molecular_function</td>\n",
" <td>1654</td>\n",
" <td>IL1R2</td>\n",
" <td>NCBI</td>\n",
" <td>7850</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6198359</th>\n",
" <td>54290</td>\n",
" <td>enzyme binding</td>\n",
" <td>GO</td>\n",
" <td>19899</td>\n",
" <td>molecular_function</td>\n",
" <td>3578</td>\n",
" <td>ECM1</td>\n",
" <td>NCBI</td>\n",
" <td>1893</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6198366</th>\n",
" <td>54290</td>\n",
" <td>enzyme binding</td>\n",
" <td>GO</td>\n",
" <td>19899</td>\n",
" <td>molecular_function</td>\n",
" <td>2057</td>\n",
" <td>FN1</td>\n",
" <td>NCBI</td>\n",
" <td>2335</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6198442</th>\n",
" <td>54290</td>\n",
" <td>enzyme binding</td>\n",
" <td>GO</td>\n",
" <td>19899</td>\n",
" <td>molecular_function</td>\n",
" <td>989</td>\n",
" <td>PPARG</td>\n",
" <td>NCBI</td>\n",
" <td>5468</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6198462</th>\n",
" <td>54290</td>\n",
" <td>enzyme binding</td>\n",
" <td>GO</td>\n",
" <td>19899</td>\n",
" <td>molecular_function</td>\n",
" <td>772</td>\n",
" <td>RELA</td>\n",
" <td>NCBI</td>\n",
" <td>5970</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3553533</th>\n",
" <td>6229</td>\n",
" <td>NOD2</td>\n",
" <td>NCBI</td>\n",
" <td>64127</td>\n",
" <td>gene/protein</td>\n",
" <td>122117</td>\n",
" <td>muramyl dipeptide binding</td>\n",
" <td>GO</td>\n",
" <td>32500</td>\n",
" <td>molecular_function</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3553770</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>115199</td>\n",
" <td>GTP-dependent protein kinase activity</td>\n",
" <td>GO</td>\n",
" <td>34211</td>\n",
" <td>molecular_function</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3553771</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>118105</td>\n",
" <td>beta-catenin destruction complex binding</td>\n",
" <td>GO</td>\n",
" <td>1904713</td>\n",
" <td>molecular_function</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3553773</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>119847</td>\n",
" <td>peroxidase inhibitor activity</td>\n",
" <td>GO</td>\n",
" <td>36479</td>\n",
" <td>molecular_function</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3553832</th>\n",
" <td>22105</td>\n",
" <td>GPBAR1</td>\n",
" <td>NCBI</td>\n",
" <td>151306</td>\n",
" <td>gene/protein</td>\n",
" <td>116806</td>\n",
" <td>G protein-coupled bile acid receptor activity</td>\n",
" <td>GO</td>\n",
" <td>38182</td>\n",
" <td>molecular_function</td>\n",
" <td>interacts with</td>\n",
" <td>molfunc_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1466 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source head_id \\\n",
"6198264 54035 interleukin-1 binding GO 19966 \n",
"6198359 54290 enzyme binding GO 19899 \n",
"6198366 54290 enzyme binding GO 19899 \n",
"6198442 54290 enzyme binding GO 19899 \n",
"6198462 54290 enzyme binding GO 19899 \n",
"... ... ... ... ... \n",
"3553533 6229 NOD2 NCBI 64127 \n",
"3553770 2111 LRRK2 NCBI 120892 \n",
"3553771 2111 LRRK2 NCBI 120892 \n",
"3553773 2111 LRRK2 NCBI 120892 \n",
"3553832 22105 GPBAR1 NCBI 151306 \n",
"\n",
" head_type tail_index \\\n",
"6198264 molecular_function 1654 \n",
"6198359 molecular_function 3578 \n",
"6198366 molecular_function 2057 \n",
"6198442 molecular_function 989 \n",
"6198462 molecular_function 772 \n",
"... ... ... \n",
"3553533 gene/protein 122117 \n",
"3553770 gene/protein 115199 \n",
"3553771 gene/protein 118105 \n",
"3553773 gene/protein 119847 \n",
"3553832 gene/protein 116806 \n",
"\n",
" tail_name tail_source tail_id \\\n",
"6198264 IL1R2 NCBI 7850 \n",
"6198359 ECM1 NCBI 1893 \n",
"6198366 FN1 NCBI 2335 \n",
"6198442 PPARG NCBI 5468 \n",
"6198462 RELA NCBI 5970 \n",
"... ... ... ... \n",
"3553533 muramyl dipeptide binding GO 32500 \n",
"3553770 GTP-dependent protein kinase activity GO 34211 \n",
"3553771 beta-catenin destruction complex binding GO 1904713 \n",
"3553773 peroxidase inhibitor activity GO 36479 \n",
"3553832 G protein-coupled bile acid receptor activity GO 38182 \n",
"\n",
" tail_type display_relation relation \n",
"6198264 gene/protein interacts with molfunc_protein \n",
"6198359 gene/protein interacts with molfunc_protein \n",
"6198366 gene/protein interacts with molfunc_protein \n",
"6198442 gene/protein interacts with molfunc_protein \n",
"6198462 gene/protein interacts with molfunc_protein \n",
"... ... ... ... \n",
"3553533 molecular_function interacts with molfunc_protein \n",
"3553770 molecular_function interacts with molfunc_protein \n",
"3553771 molecular_function interacts with molfunc_protein \n",
"3553773 molecular_function interacts with molfunc_protein \n",
"3553832 molecular_function interacts with molfunc_protein \n",
"\n",
"[1466 rows x 12 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# IBD molfunc_protein edges \n",
"ibd_molfunc_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'molecular_function') & \n",
" (primekg_edges.tail_type == 'gene/protein') & \n",
" (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
" primekg_edges[(primekg_edges.tail_type == 'molecular_function') & \n",
" (primekg_edges.head_type == 'gene/protein') & \n",
" (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
"\n",
"# Check dataframe\n",
"ibd_molfunc_protein_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### CellComp-Protein Relationship"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we are getting the records containing the relationships of cellular_component-gene/protein nodes."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6267848</th>\n",
" <td>126078</td>\n",
" <td>ficolin-1-rich granule lumen</td>\n",
" <td>GO</td>\n",
" <td>1904813</td>\n",
" <td>cellular_component</td>\n",
" <td>3474</td>\n",
" <td>MMP9</td>\n",
" <td>NCBI</td>\n",
" <td>4318</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6268120</th>\n",
" <td>124245</td>\n",
" <td>extracellular space</td>\n",
" <td>GO</td>\n",
" <td>5615</td>\n",
" <td>cellular_component</td>\n",
" <td>2384</td>\n",
" <td>CRP</td>\n",
" <td>NCBI</td>\n",
" <td>1401</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6268163</th>\n",
" <td>124245</td>\n",
" <td>extracellular space</td>\n",
" <td>GO</td>\n",
" <td>5615</td>\n",
" <td>cellular_component</td>\n",
" <td>5805</td>\n",
" <td>DEFA5</td>\n",
" <td>NCBI</td>\n",
" <td>1670</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6268164</th>\n",
" <td>124245</td>\n",
" <td>extracellular space</td>\n",
" <td>GO</td>\n",
" <td>5615</td>\n",
" <td>cellular_component</td>\n",
" <td>657</td>\n",
" <td>DEFA6</td>\n",
" <td>NCBI</td>\n",
" <td>1671</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6268173</th>\n",
" <td>124245</td>\n",
" <td>extracellular space</td>\n",
" <td>GO</td>\n",
" <td>5615</td>\n",
" <td>cellular_component</td>\n",
" <td>3578</td>\n",
" <td>ECM1</td>\n",
" <td>NCBI</td>\n",
" <td>1893</td>\n",
" <td>gene/protein</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3636708</th>\n",
" <td>2139</td>\n",
" <td>ARPC2</td>\n",
" <td>NCBI</td>\n",
" <td>10109</td>\n",
" <td>gene/protein</td>\n",
" <td>126261</td>\n",
" <td>muscle cell projection membrane</td>\n",
" <td>GO</td>\n",
" <td>36195</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3636819</th>\n",
" <td>9763</td>\n",
" <td>ORMDL3</td>\n",
" <td>NCBI</td>\n",
" <td>94103</td>\n",
" <td>gene/protein</td>\n",
" <td>126815</td>\n",
" <td>SPOTS complex</td>\n",
" <td>GO</td>\n",
" <td>35339</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3637211</th>\n",
" <td>6661</td>\n",
" <td>ATG16L1</td>\n",
" <td>NCBI</td>\n",
" <td>55054</td>\n",
" <td>gene/protein</td>\n",
" <td>126444</td>\n",
" <td>vacuole-isolation membrane contact site</td>\n",
" <td>GO</td>\n",
" <td>120095</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3637234</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>126938</td>\n",
" <td>cytoplasmic side of mitochondrial outer membrane</td>\n",
" <td>GO</td>\n",
" <td>32473</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3637328</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>125942</td>\n",
" <td>caveola neck</td>\n",
" <td>GO</td>\n",
" <td>99400</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1348 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source head_id \\\n",
"6267848 126078 ficolin-1-rich granule lumen GO 1904813 \n",
"6268120 124245 extracellular space GO 5615 \n",
"6268163 124245 extracellular space GO 5615 \n",
"6268164 124245 extracellular space GO 5615 \n",
"6268173 124245 extracellular space GO 5615 \n",
"... ... ... ... ... \n",
"3636708 2139 ARPC2 NCBI 10109 \n",
"3636819 9763 ORMDL3 NCBI 94103 \n",
"3637211 6661 ATG16L1 NCBI 55054 \n",
"3637234 2111 LRRK2 NCBI 120892 \n",
"3637328 2111 LRRK2 NCBI 120892 \n",
"\n",
" head_type tail_index \\\n",
"6267848 cellular_component 3474 \n",
"6268120 cellular_component 2384 \n",
"6268163 cellular_component 5805 \n",
"6268164 cellular_component 657 \n",
"6268173 cellular_component 3578 \n",
"... ... ... \n",
"3636708 gene/protein 126261 \n",
"3636819 gene/protein 126815 \n",
"3637211 gene/protein 126444 \n",
"3637234 gene/protein 126938 \n",
"3637328 gene/protein 125942 \n",
"\n",
" tail_name tail_source tail_id \\\n",
"6267848 MMP9 NCBI 4318 \n",
"6268120 CRP NCBI 1401 \n",
"6268163 DEFA5 NCBI 1670 \n",
"6268164 DEFA6 NCBI 1671 \n",
"6268173 ECM1 NCBI 1893 \n",
"... ... ... ... \n",
"3636708 muscle cell projection membrane GO 36195 \n",
"3636819 SPOTS complex GO 35339 \n",
"3637211 vacuole-isolation membrane contact site GO 120095 \n",
"3637234 cytoplasmic side of mitochondrial outer membrane GO 32473 \n",
"3637328 caveola neck GO 99400 \n",
"\n",
" tail_type display_relation relation \n",
"6267848 gene/protein interacts with cellcomp_protein \n",
"6268120 gene/protein interacts with cellcomp_protein \n",
"6268163 gene/protein interacts with cellcomp_protein \n",
"6268164 gene/protein interacts with cellcomp_protein \n",
"6268173 gene/protein interacts with cellcomp_protein \n",
"... ... ... ... \n",
"3636708 cellular_component interacts with cellcomp_protein \n",
"3636819 cellular_component interacts with cellcomp_protein \n",
"3637211 cellular_component interacts with cellcomp_protein \n",
"3637234 cellular_component interacts with cellcomp_protein \n",
"3637328 cellular_component interacts with cellcomp_protein \n",
"\n",
"[1348 rows x 12 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# IBD molfunc_protein edges \n",
"ibd_cellcomp_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'cellular_component') & \n",
" (primekg_edges.tail_type == 'gene/protein') & \n",
" (primekg_edges.tail_index.isin(ibd_protein_index))],\n",
" primekg_edges[(primekg_edges.tail_type == 'cellular_component') & \n",
" (primekg_edges.head_type == 'gene/protein') & \n",
" (primekg_edges.head_index.isin(ibd_protein_index))]])\n",
"\n",
"# Check dataframe\n",
"ibd_cellcomp_protein_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Merge all dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have all of particular type of edges, we can merge them into a single dataframe representing a subgraph of IBD inferred from PrimeKG."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" <th>edge_type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>NCBI</td>\n",
" <td>113</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>NCBI</td>\n",
" <td>113</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>NCBI</td>\n",
" <td>639</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>NCBI</td>\n",
" <td>639</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>2712</td>\n",
" <td>CASP3</td>\n",
" <td>NCBI</td>\n",
" <td>836</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12747</th>\n",
" <td>2139</td>\n",
" <td>ARPC2</td>\n",
" <td>NCBI</td>\n",
" <td>10109</td>\n",
" <td>gene/protein</td>\n",
" <td>126261</td>\n",
" <td>muscle cell projection membrane</td>\n",
" <td>GO</td>\n",
" <td>36195</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" <td>(gene/protein, interacts with, cellular_compon...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12748</th>\n",
" <td>9763</td>\n",
" <td>ORMDL3</td>\n",
" <td>NCBI</td>\n",
" <td>94103</td>\n",
" <td>gene/protein</td>\n",
" <td>126815</td>\n",
" <td>SPOTS complex</td>\n",
" <td>GO</td>\n",
" <td>35339</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" <td>(gene/protein, interacts with, cellular_compon...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12749</th>\n",
" <td>6661</td>\n",
" <td>ATG16L1</td>\n",
" <td>NCBI</td>\n",
" <td>55054</td>\n",
" <td>gene/protein</td>\n",
" <td>126444</td>\n",
" <td>vacuole-isolation membrane contact site</td>\n",
" <td>GO</td>\n",
" <td>120095</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" <td>(gene/protein, interacts with, cellular_compon...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12750</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>126938</td>\n",
" <td>cytoplasmic side of mitochondrial outer membrane</td>\n",
" <td>GO</td>\n",
" <td>32473</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" <td>(gene/protein, interacts with, cellular_compon...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12751</th>\n",
" <td>2111</td>\n",
" <td>LRRK2</td>\n",
" <td>NCBI</td>\n",
" <td>120892</td>\n",
" <td>gene/protein</td>\n",
" <td>125942</td>\n",
" <td>caveola neck</td>\n",
" <td>GO</td>\n",
" <td>99400</td>\n",
" <td>cellular_component</td>\n",
" <td>interacts with</td>\n",
" <td>cellcomp_protein</td>\n",
" <td>(gene/protein, interacts with, cellular_compon...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>12752 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source \\\n",
"0 37785 ulcerative colitis (disease) MONDO \n",
"1 28158 inflammatory bowel disease MONDO_grouped \n",
"2 37785 ulcerative colitis (disease) MONDO \n",
"3 28158 inflammatory bowel disease MONDO_grouped \n",
"4 37785 ulcerative colitis (disease) MONDO \n",
"... ... ... ... \n",
"12747 2139 ARPC2 NCBI \n",
"12748 9763 ORMDL3 NCBI \n",
"12749 6661 ATG16L1 NCBI \n",
"12750 2111 LRRK2 NCBI \n",
"12751 2111 LRRK2 NCBI \n",
"\n",
" head_id head_type \\\n",
"0 5101 disease \n",
"1 9960_12845_33643_11471_12831_12875_12941_13153... disease \n",
"2 5101 disease \n",
"3 9960_12845_33643_11471_12831_12875_12941_13153... disease \n",
"4 5101 disease \n",
"... ... ... \n",
"12747 10109 gene/protein \n",
"12748 94103 gene/protein \n",
"12749 55054 gene/protein \n",
"12750 120892 gene/protein \n",
"12751 120892 gene/protein \n",
"\n",
" tail_index tail_name \\\n",
"0 7359 ADCY7 \n",
"1 7359 ADCY7 \n",
"2 2874 PRDM1 \n",
"3 2874 PRDM1 \n",
"4 2712 CASP3 \n",
"... ... ... \n",
"12747 126261 muscle cell projection membrane \n",
"12748 126815 SPOTS complex \n",
"12749 126444 vacuole-isolation membrane contact site \n",
"12750 126938 cytoplasmic side of mitochondrial outer membrane \n",
"12751 125942 caveola neck \n",
"\n",
" tail_source tail_id tail_type display_relation \\\n",
"0 NCBI 113 gene/protein associated with \n",
"1 NCBI 113 gene/protein associated with \n",
"2 NCBI 639 gene/protein associated with \n",
"3 NCBI 639 gene/protein associated with \n",
"4 NCBI 836 gene/protein associated with \n",
"... ... ... ... ... \n",
"12747 GO 36195 cellular_component interacts with \n",
"12748 GO 35339 cellular_component interacts with \n",
"12749 GO 120095 cellular_component interacts with \n",
"12750 GO 32473 cellular_component interacts with \n",
"12751 GO 99400 cellular_component interacts with \n",
"\n",
" relation edge_type \n",
"0 disease_protein (disease, associated with, gene/protein) \n",
"1 disease_protein (disease, associated with, gene/protein) \n",
"2 disease_protein (disease, associated with, gene/protein) \n",
"3 disease_protein (disease, associated with, gene/protein) \n",
"4 disease_protein (disease, associated with, gene/protein) \n",
"... ... ... \n",
"12747 cellcomp_protein (gene/protein, interacts with, cellular_compon... \n",
"12748 cellcomp_protein (gene/protein, interacts with, cellular_compon... \n",
"12749 cellcomp_protein (gene/protein, interacts with, cellular_compon... \n",
"12750 cellcomp_protein (gene/protein, interacts with, cellular_compon... \n",
"12751 cellcomp_protein (gene/protein, interacts with, cellular_compon... \n",
"\n",
"[12752 rows x 13 columns]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# PrimeKG edges related to IBD\n",
"primekg_ibd_edges_df = pd.concat([ibd_disease_protein_edges_df,\n",
" # ibd_disease_disease_edges_df,\n",
" # ibd_protein_protein_edges_df,\n",
" ibd_drug_protein_edges_df,\n",
" ibd_pathway_protein_edges_df,\n",
" # ibd_pathway_pathway_edges_df,\n",
" ibd_bioprocess_protein_edges_df,\n",
" ibd_molfunc_protein_edges_df,\n",
" ibd_cellcomp_protein_edges_df])\n",
"primekg_ibd_edges_df[\"edge_type\"] = primekg_ibd_edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)\n",
"primekg_ibd_edges_df.drop_duplicates(subset=['head_index', 'tail_index'], inplace=True)\n",
"primekg_ibd_edges_df.reset_index(drop=True, inplace=True)\n",
"primekg_ibd_edges_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get a dataframe of nodes based on the above edge dataframe as follows:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_index</th>\n",
" <th>node_name</th>\n",
" <th>node_source</th>\n",
" <th>node_id</th>\n",
" <th>node_type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>144</th>\n",
" <td>144</td>\n",
" <td>SMAD3</td>\n",
" <td>NCBI</td>\n",
" <td>4088</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>179</th>\n",
" <td>179</td>\n",
" <td>IL10RB</td>\n",
" <td>NCBI</td>\n",
" <td>3588</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192</th>\n",
" <td>192</td>\n",
" <td>GNA12</td>\n",
" <td>NCBI</td>\n",
" <td>2768</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>279</th>\n",
" <td>279</td>\n",
" <td>HNF4A</td>\n",
" <td>NCBI</td>\n",
" <td>3172</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>417</th>\n",
" <td>417</td>\n",
" <td>VCAM1</td>\n",
" <td>NCBI</td>\n",
" <td>7412</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129360</th>\n",
" <td>129360</td>\n",
" <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-975163</td>\n",
" <td>pathway</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129361</th>\n",
" <td>129361</td>\n",
" <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-975110</td>\n",
" <td>pathway</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129365</th>\n",
" <td>129365</td>\n",
" <td>Antigen processing: Ubiquitination & Proteasom...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-983168</td>\n",
" <td>pathway</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129366</th>\n",
" <td>129366</td>\n",
" <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-983170</td>\n",
" <td>pathway</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129367</th>\n",
" <td>129367</td>\n",
" <td>Kinesins</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-983189</td>\n",
" <td>pathway</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3426 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" node_index node_name \\\n",
"144 144 SMAD3 \n",
"179 179 IL10RB \n",
"192 192 GNA12 \n",
"279 279 HNF4A \n",
"417 417 VCAM1 \n",
"... ... ... \n",
"129360 129360 IRAK2 mediated activation of TAK1 complex upon... \n",
"129361 129361 TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... \n",
"129365 129365 Antigen processing: Ubiquitination & Proteasom... \n",
"129366 129366 Antigen Presentation: Folding, assembly and pe... \n",
"129367 129367 Kinesins \n",
"\n",
" node_source node_id node_type \n",
"144 NCBI 4088 gene/protein \n",
"179 NCBI 3588 gene/protein \n",
"192 NCBI 2768 gene/protein \n",
"279 NCBI 3172 gene/protein \n",
"417 NCBI 7412 gene/protein \n",
"... ... ... ... \n",
"129360 REACTOME R-HSA-975163 pathway \n",
"129361 REACTOME R-HSA-975110 pathway \n",
"129365 REACTOME R-HSA-983168 pathway \n",
"129366 REACTOME R-HSA-983170 pathway \n",
"129367 REACTOME R-HSA-983189 pathway \n",
"\n",
"[3426 rows x 5 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# PrimeKG nodes related to IBD\n",
"primekg_ibd_nodes_df = primekg_nodes[primekg_nodes.index.isin(np.unique(np.hstack([primekg_ibd_edges_df.head_index.unique(), \n",
" primekg_ibd_edges_df.tail_index.unique()])))]\n",
"primekg_ibd_nodes_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can store the nodes and edges related to IBD in a parquet file for future use."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# Store the IBD-related nodes and edges\n",
"local_dir = '../../../../data/primekg_ibd/'\n",
"if not os.path.exists(local_dir):\n",
" os.makedirs(local_dir)\n",
"primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes.parquet'), compression='gzip', index=False)\n",
"primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges.parquet'), compression='gzip', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of IBD-related nodes: 3426\n",
"Number of IBD-related edges: 12752\n"
]
}
],
"source": [
"# Statistics over the IBD-related nodes and edges\n",
"print(f\"Number of IBD-related nodes: {primekg_ibd_nodes_df.shape[0]}\")\n",
"print(f\"Number of IBD-related edges: {primekg_ibd_edges_df.shape[0]}\")"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"node_type\n",
"biological_process 1642\n",
"cellular_component 207\n",
"disease 7\n",
"drug 835\n",
"gene/protein 103\n",
"molecular_function 324\n",
"pathway 308\n",
"dtype: int64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count the number of nodes by node type\n",
"primekg_ibd_nodes_df.groupby('node_type').size()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"relation display_relation\n",
"bioprocess_protein interacts with 6300\n",
"cellcomp_protein interacts with 1348\n",
"disease_protein associated with 620\n",
"drug_protein carrier 8\n",
" enzyme 64\n",
" target 776\n",
" transporter 1140\n",
"molfunc_protein interacts with 1466\n",
"pathway_protein interacts with 1030\n",
"dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count the number of edges by relation and display_relation\n",
"primekg_ibd_edges_df.groupby(['relation','display_relation']).size()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"edge_type\n",
"(biological_process, interacts with, gene/protein) 3150\n",
"(cellular_component, interacts with, gene/protein) 674\n",
"(disease, associated with, gene/protein) 310\n",
"(drug, carrier, gene/protein) 4\n",
"(drug, enzyme, gene/protein) 32\n",
"(drug, target, gene/protein) 388\n",
"(drug, transporter, gene/protein) 570\n",
"(gene/protein, associated with, disease) 310\n",
"(gene/protein, carrier, drug) 4\n",
"(gene/protein, enzyme, drug) 32\n",
"(gene/protein, interacts with, biological_process) 3150\n",
"(gene/protein, interacts with, cellular_component) 674\n",
"(gene/protein, interacts with, molecular_function) 733\n",
"(gene/protein, interacts with, pathway) 515\n",
"(gene/protein, target, drug) 388\n",
"(gene/protein, transporter, drug) 570\n",
"(molecular_function, interacts with, gene/protein) 733\n",
"(pathway, interacts with, gene/protein) 515\n",
"dtype: int64"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count the number of edges by edge type\n",
"primekg_ibd_edges_df.groupby(['edge_type']).size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enrichment (using textual as of now)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From this point onwards, we will use the pre-processed IBD-related nodes and edges to create a set of graph formats.\n",
"\n",
"Before that, we should perform enrichment and embedding over the IBD-related nodes and edges.\n",
"\n",
"As of now, we will conduct a textual enrichment over the records.\n",
"\n",
"Since StarQA provide most of information of the nodes, we will use StarkQA to get the information of the nodes related to IBD."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading StarkQAPrimeKG dataset...\n",
"../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.\n",
"Loading StarkQAPrimeKG embeddings...\n"
]
}
],
"source": [
"# Define starkqa primekg data by providing a local directory where the data is stored\n",
"starkqa_data = StarkQAPrimeKG(local_dir=\"../../../../data/starkqa_primekg/\")\n",
"\n",
"# Invoke a method to load the data\n",
"starkqa_data.load_data()\n",
"\n",
"# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information\n",
"# starkqa_df = starkqa_data.get_starkqa()\n",
"starkqa_node_info = starkqa_data.get_starkqa_node_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that not all nodes in the StarkQA-PrimeKG have additional information. \n",
"\n",
"For this case, we provide a basic text enrichment for the nodes by simply specifying their node name and type."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"def do_enrichment_text(data, starkqa_node_info):\n",
" \"\"\"\n",
" Enrich the node with additional textual information from BioBridge and StarkQA.\n",
"\n",
" Args:\n",
" data (dict): The node data from PrimeKG\n",
" starkqa_node_info (dict): The node information from StarkQA-PrimeKG\n",
" \"\"\"\n",
" # Basic textual enrichment of the node\n",
" enriched_node = f\"{data['node_name']} belongs to {data['node_type']} category. \"\n",
"\n",
" # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which\n",
" # has additional information in the node_info of StarkQA-PrimeKG\n",
" added_info = ''\n",
" if data['node_type'] == 'gene/protein':\n",
" added_info = starkqa_node_info['details']['summary'] if 'summary' in starkqa_node_info['details'] else ''\n",
" elif data['node_type'] == 'drug':\n",
" added_info = ' '.join([str(starkqa_node_info['details']['description']).replace('nan', ''),\n",
" str(starkqa_node_info['details']['mechanism_of_action']).replace('nan', ''),\n",
" str(starkqa_node_info['details']['protein_binding']).replace('nan', ''),\n",
" str(starkqa_node_info['details']['pharmacodynamics']).replace('nan', ''),\n",
" str(starkqa_node_info['details']['indication']).replace('nan', '')])\n",
" elif data['node_type'] == 'disease':\n",
" added_info = ' '.join([str(starkqa_node_info['details']['mondo_definition']).replace('nan', ''),\n",
" str(starkqa_node_info['details']['mayo_symptoms']).replace('nan', ''),\n",
" str(starkqa_node_info['details']['mayo_causes']).replace('nan', '')])\n",
" elif data['node_type'] == 'pathway':\n",
" added_info += f\"This pathway found in {starkqa_node_info['details']['speciesName']}. \" + ' '.join([x['text'] for x in starkqa_node_info['details']['summation']]) if 'details' in starkqa_node_info else ''\n",
"\n",
" # Append the additional information for enrichment\n",
" enriched_node += added_info\n",
" return enriched_node"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using the above function, we can enrich the node information from PrimeKG with additional information from StarkQA-PrimeKG as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_64662/2873064541.py:3: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_index</th>\n",
" <th>node_name</th>\n",
" <th>node_source</th>\n",
" <th>node_id</th>\n",
" <th>node_type</th>\n",
" <th>enriched_node</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>144</th>\n",
" <td>144</td>\n",
" <td>SMAD3</td>\n",
" <td>NCBI</td>\n",
" <td>4088</td>\n",
" <td>gene/protein</td>\n",
" <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>179</th>\n",
" <td>179</td>\n",
" <td>IL10RB</td>\n",
" <td>NCBI</td>\n",
" <td>3588</td>\n",
" <td>gene/protein</td>\n",
" <td>IL10RB belongs to gene/protein category. The p...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192</th>\n",
" <td>192</td>\n",
" <td>GNA12</td>\n",
" <td>NCBI</td>\n",
" <td>2768</td>\n",
" <td>gene/protein</td>\n",
" <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>279</th>\n",
" <td>279</td>\n",
" <td>HNF4A</td>\n",
" <td>NCBI</td>\n",
" <td>3172</td>\n",
" <td>gene/protein</td>\n",
" <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>417</th>\n",
" <td>417</td>\n",
" <td>VCAM1</td>\n",
" <td>NCBI</td>\n",
" <td>7412</td>\n",
" <td>gene/protein</td>\n",
" <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129360</th>\n",
" <td>129360</td>\n",
" <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-975163</td>\n",
" <td>pathway</td>\n",
" <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129361</th>\n",
" <td>129361</td>\n",
" <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-975110</td>\n",
" <td>pathway</td>\n",
" <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129365</th>\n",
" <td>129365</td>\n",
" <td>Antigen processing: Ubiquitination & Proteasom...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-983168</td>\n",
" <td>pathway</td>\n",
" <td>Antigen processing: Ubiquitination & Proteasom...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129366</th>\n",
" <td>129366</td>\n",
" <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-983170</td>\n",
" <td>pathway</td>\n",
" <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129367</th>\n",
" <td>129367</td>\n",
" <td>Kinesins</td>\n",
" <td>REACTOME</td>\n",
" <td>R-HSA-983189</td>\n",
" <td>pathway</td>\n",
" <td>Kinesins belongs to pathway category. This pat...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3426 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" node_index node_name \\\n",
"144 144 SMAD3 \n",
"179 179 IL10RB \n",
"192 192 GNA12 \n",
"279 279 HNF4A \n",
"417 417 VCAM1 \n",
"... ... ... \n",
"129360 129360 IRAK2 mediated activation of TAK1 complex upon... \n",
"129361 129361 TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... \n",
"129365 129365 Antigen processing: Ubiquitination & Proteasom... \n",
"129366 129366 Antigen Presentation: Folding, assembly and pe... \n",
"129367 129367 Kinesins \n",
"\n",
" node_source node_id node_type \\\n",
"144 NCBI 4088 gene/protein \n",
"179 NCBI 3588 gene/protein \n",
"192 NCBI 2768 gene/protein \n",
"279 NCBI 3172 gene/protein \n",
"417 NCBI 7412 gene/protein \n",
"... ... ... ... \n",
"129360 REACTOME R-HSA-975163 pathway \n",
"129361 REACTOME R-HSA-975110 pathway \n",
"129365 REACTOME R-HSA-983168 pathway \n",
"129366 REACTOME R-HSA-983170 pathway \n",
"129367 REACTOME R-HSA-983189 pathway \n",
"\n",
" enriched_node \n",
"144 SMAD3 belongs to gene/protein category. The SM... \n",
"179 IL10RB belongs to gene/protein category. The p... \n",
"192 GNA12 belongs to gene/protein category. Predic... \n",
"279 HNF4A belongs to gene/protein category. The pr... \n",
"417 VCAM1 belongs to gene/protein category. This g... \n",
"... ... \n",
"129360 IRAK2 mediated activation of TAK1 complex upon... \n",
"129361 TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... \n",
"129365 Antigen processing: Ubiquitination & Proteasom... \n",
"129366 Antigen Presentation: Folding, assembly and pe... \n",
"129367 Kinesins belongs to pathway category. This pat... \n",
"\n",
"[3426 rows x 6 columns]"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Perform node enrichment for each row in primekg_nodes\n",
"text_enriched_nodes = primekg_ibd_nodes_df.apply(lambda x: do_enrichment_text(x, starkqa_node_info[x['node_index']]), axis=1).tolist()\n",
"primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes\n",
"primekg_ibd_nodes_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Subsequently, we can perform similar textual enrichment for the edges in PrimeKG.\n",
"\n",
"Since StarkQA only provides node information, we can only enrich the edges with basic information of the triples in combination with the head and tail nodes."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" <th>edge_type</th>\n",
" <th>enriched_edge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>NCBI</td>\n",
" <td>113</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>NCBI</td>\n",
" <td>113</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>inflammatory bowel disease (disease) has a dir...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>NCBI</td>\n",
" <td>639</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>MONDO_grouped</td>\n",
" <td>9960_12845_33643_11471_12831_12875_12941_13153...</td>\n",
" <td>disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>NCBI</td>\n",
" <td>639</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>inflammatory bowel disease (disease) has a dir...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>MONDO</td>\n",
" <td>5101</td>\n",
" <td>disease</td>\n",
" <td>2712</td>\n",
" <td>CASP3</td>\n",
" <td>NCBI</td>\n",
" <td>836</td>\n",
" <td>gene/protein</td>\n",
" <td>associated with</td>\n",
" <td>disease_protein</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source \\\n",
"0 37785 ulcerative colitis (disease) MONDO \n",
"1 28158 inflammatory bowel disease MONDO_grouped \n",
"2 37785 ulcerative colitis (disease) MONDO \n",
"3 28158 inflammatory bowel disease MONDO_grouped \n",
"4 37785 ulcerative colitis (disease) MONDO \n",
"\n",
" head_id head_type tail_index \\\n",
"0 5101 disease 7359 \n",
"1 9960_12845_33643_11471_12831_12875_12941_13153... disease 7359 \n",
"2 5101 disease 2874 \n",
"3 9960_12845_33643_11471_12831_12875_12941_13153... disease 2874 \n",
"4 5101 disease 2712 \n",
"\n",
" tail_name tail_source tail_id tail_type display_relation \\\n",
"0 ADCY7 NCBI 113 gene/protein associated with \n",
"1 ADCY7 NCBI 113 gene/protein associated with \n",
"2 PRDM1 NCBI 639 gene/protein associated with \n",
"3 PRDM1 NCBI 639 gene/protein associated with \n",
"4 CASP3 NCBI 836 gene/protein associated with \n",
"\n",
" relation edge_type \\\n",
"0 disease_protein (disease, associated with, gene/protein) \n",
"1 disease_protein (disease, associated with, gene/protein) \n",
"2 disease_protein (disease, associated with, gene/protein) \n",
"3 disease_protein (disease, associated with, gene/protein) \n",
"4 disease_protein (disease, associated with, gene/protein) \n",
"\n",
" enriched_edge \n",
"0 ulcerative colitis (disease) (disease) has a d... \n",
"1 inflammatory bowel disease (disease) has a dir... \n",
"2 ulcerative colitis (disease) (disease) has a d... \n",
"3 inflammatory bowel disease (disease) has a dir... \n",
"4 ulcerative colitis (disease) (disease) has a d... "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information\n",
"text_enriched_edges = primekg_ibd_edges_df.apply(lambda x: f\"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).\", axis=1).tolist()\n",
"primekg_ibd_edges_df['enriched_edge'] = text_enriched_edges\n",
"primekg_ibd_edges_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Embeddings (using textual embedding as of now)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to perform embedding using the enriched nodes and edges by leveraging `EmbeddingWithOllama` class.\n",
"\n",
"For this purpose, we will use `nomic-embed-text`."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"# Using nomic-ai/nomic-embed-text-v1.5 model\n",
"emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Node Embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will perform node embedding for the IBD-related nodes using the Ollama model by using mini-batches of 100 nodes at a time."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 35/35 [00:19<00:00, 1.75it/s]\n"
]
}
],
"source": [
"# Since the records of nodes has large amount of data, we will split them into mini-batches\n",
"mini_batch_size = 100\n",
"node_embeddings = []\n",
"for i in tqdm(range(0, primekg_ibd_nodes_df.shape[0], mini_batch_size)):\n",
" outputs = emb_model.embed_documents(primekg_ibd_nodes_df.enriched_node.values.tolist()[i:i+mini_batch_size])\n",
" node_embeddings.extend(outputs)\n",
"# node_embeddings"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3426, 768)"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the shape of the node embeddings\n",
"len(node_embeddings), len(node_embeddings[0])"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_64662/3470083233.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df['x'] = node_embeddings\n",
"/tmp/ipykernel_64662/3470083233.py:5: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)\n",
"/tmp/ipykernel_64662/3470083233.py:6: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_id</th>\n",
" <th>node_name</th>\n",
" <th>node_type</th>\n",
" <th>enriched_node</th>\n",
" <th>x</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>144</th>\n",
" <td>144</td>\n",
" <td>SMAD3</td>\n",
" <td>gene/protein</td>\n",
" <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
" <td>[0.026536005, 0.05420931, -0.17033643, -0.0248...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>179</th>\n",
" <td>179</td>\n",
" <td>IL10RB</td>\n",
" <td>gene/protein</td>\n",
" <td>IL10RB belongs to gene/protein category. The p...</td>\n",
" <td>[0.024764946, 0.022782002, -0.16956052, -0.033...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192</th>\n",
" <td>192</td>\n",
" <td>GNA12</td>\n",
" <td>gene/protein</td>\n",
" <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
" <td>[0.004795947, 0.04921528, -0.14488313, -0.0492...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>279</th>\n",
" <td>279</td>\n",
" <td>HNF4A</td>\n",
" <td>gene/protein</td>\n",
" <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
" <td>[0.013905027, 0.032602787, -0.15260702, 0.0074...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>417</th>\n",
" <td>417</td>\n",
" <td>VCAM1</td>\n",
" <td>gene/protein</td>\n",
" <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
" <td>[0.047299746, 0.032621186, -0.15677826, -0.021...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" node_id node_name node_type \\\n",
"144 144 SMAD3 gene/protein \n",
"179 179 IL10RB gene/protein \n",
"192 192 GNA12 gene/protein \n",
"279 279 HNF4A gene/protein \n",
"417 417 VCAM1 gene/protein \n",
"\n",
" enriched_node \\\n",
"144 SMAD3 belongs to gene/protein category. The SM... \n",
"179 IL10RB belongs to gene/protein category. The p... \n",
"192 GNA12 belongs to gene/protein category. Predic... \n",
"279 HNF4A belongs to gene/protein category. The pr... \n",
"417 VCAM1 belongs to gene/protein category. This g... \n",
"\n",
" x \n",
"144 [0.026536005, 0.05420931, -0.17033643, -0.0248... \n",
"179 [0.024764946, 0.022782002, -0.16956052, -0.033... \n",
"192 [0.004795947, 0.04921528, -0.14488313, -0.0492... \n",
"279 [0.013905027, 0.032602787, -0.15260702, 0.0074... \n",
"417 [0.047299746, 0.032621186, -0.15677826, -0.021... "
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Add them as features to the dataframe\n",
"primekg_ibd_nodes_df['x'] = node_embeddings\n",
"\n",
"# Drop and rename several columns\n",
"primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)\n",
"primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)\n",
"\n",
"# Check dataframe of nodes\n",
"primekg_ibd_nodes_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_64662/1471123717.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_id</th>\n",
" <th>node_name</th>\n",
" <th>node_type</th>\n",
" <th>enriched_node</th>\n",
" <th>x</th>\n",
" </tr>\n",
" <tr>\n",
" <th>node</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>144</th>\n",
" <td>144</td>\n",
" <td>SMAD3</td>\n",
" <td>gene/protein</td>\n",
" <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
" <td>[0.026536005, 0.05420931, -0.17033643, -0.0248...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>179</th>\n",
" <td>179</td>\n",
" <td>IL10RB</td>\n",
" <td>gene/protein</td>\n",
" <td>IL10RB belongs to gene/protein category. The p...</td>\n",
" <td>[0.024764946, 0.022782002, -0.16956052, -0.033...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192</th>\n",
" <td>192</td>\n",
" <td>GNA12</td>\n",
" <td>gene/protein</td>\n",
" <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
" <td>[0.004795947, 0.04921528, -0.14488313, -0.0492...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>279</th>\n",
" <td>279</td>\n",
" <td>HNF4A</td>\n",
" <td>gene/protein</td>\n",
" <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
" <td>[0.013905027, 0.032602787, -0.15260702, 0.0074...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>417</th>\n",
" <td>417</td>\n",
" <td>VCAM1</td>\n",
" <td>gene/protein</td>\n",
" <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
" <td>[0.047299746, 0.032621186, -0.15677826, -0.021...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" node_id node_name node_type \\\n",
"node \n",
"144 144 SMAD3 gene/protein \n",
"179 179 IL10RB gene/protein \n",
"192 192 GNA12 gene/protein \n",
"279 279 HNF4A gene/protein \n",
"417 417 VCAM1 gene/protein \n",
"\n",
" enriched_node \\\n",
"node \n",
"144 SMAD3 belongs to gene/protein category. The SM... \n",
"179 IL10RB belongs to gene/protein category. The p... \n",
"192 GNA12 belongs to gene/protein category. Predic... \n",
"279 HNF4A belongs to gene/protein category. The pr... \n",
"417 VCAM1 belongs to gene/protein category. This g... \n",
"\n",
" x \n",
"node \n",
"144 [0.026536005, 0.05420931, -0.17033643, -0.0248... \n",
"179 [0.024764946, 0.022782002, -0.16956052, -0.033... \n",
"192 [0.004795947, 0.04921528, -0.14488313, -0.0492... \n",
"279 [0.013905027, 0.032602787, -0.15260702, 0.0074... \n",
"417 [0.047299746, 0.032621186, -0.15677826, -0.021... "
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Duplicate a node_name as index and use it as index\n",
"primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']\n",
"primekg_ibd_nodes_df.set_index('node', inplace=True)\n",
"primekg_ibd_nodes_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"# Save the embedded nodes dataframes to parquet file\n",
"primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes_embedded.parquet'), compression='gzip', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Edge Embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Likewise, we also conduct node embedding for the IBD-related edges using the Ollama model by using mini-batches of 100 edges at a time."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 128/128 [00:48<00:00, 2.64it/s]\n"
]
}
],
"source": [
"# Since the records of edges has large amount of data, we will split them into mini-batches\n",
"mini_batch_size = 100\n",
"edge_embeddings = []\n",
"for i in tqdm(range(0, primekg_ibd_edges_df.shape[0], mini_batch_size)):\n",
" outputs = emb_model.embed_documents(primekg_ibd_edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])\n",
" edge_embeddings.extend(outputs)\n",
"# edge_embeddings"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(12752, 768)"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the shape of the edge embeddings\n",
"len(edge_embeddings), len(edge_embeddings[0])"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_id</th>\n",
" <th>head_name</th>\n",
" <th>tail_id</th>\n",
" <th>tail_name</th>\n",
" <th>edge_type</th>\n",
" <th>enriched_edge</th>\n",
" <th>edge_attr</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" <td>[0.061832674, 0.040013667, -0.15366873, -0.008...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>7359</td>\n",
" <td>ADCY7</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>inflammatory bowel disease (disease) has a dir...</td>\n",
" <td>[0.050393466, 0.030410834, -0.15008788, -0.013...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" <td>[0.0401622, 0.028982995, -0.15433805, 0.006565...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>28158</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>2874</td>\n",
" <td>PRDM1</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>inflammatory bowel disease (disease) has a dir...</td>\n",
" <td>[0.02781422, 0.01603875, -0.14870141, 0.004470...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>37785</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>2712</td>\n",
" <td>CASP3</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" <td>[0.07853663, 0.050751355, -0.1470567, -0.01237...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" head_id head_name tail_id tail_name \\\n",
"0 37785 ulcerative colitis (disease) 7359 ADCY7 \n",
"1 28158 inflammatory bowel disease 7359 ADCY7 \n",
"2 37785 ulcerative colitis (disease) 2874 PRDM1 \n",
"3 28158 inflammatory bowel disease 2874 PRDM1 \n",
"4 37785 ulcerative colitis (disease) 2712 CASP3 \n",
"\n",
" edge_type \\\n",
"0 (disease, associated with, gene/protein) \n",
"1 (disease, associated with, gene/protein) \n",
"2 (disease, associated with, gene/protein) \n",
"3 (disease, associated with, gene/protein) \n",
"4 (disease, associated with, gene/protein) \n",
"\n",
" enriched_edge \\\n",
"0 ulcerative colitis (disease) (disease) has a d... \n",
"1 inflammatory bowel disease (disease) has a dir... \n",
"2 ulcerative colitis (disease) (disease) has a d... \n",
"3 inflammatory bowel disease (disease) has a dir... \n",
"4 ulcerative colitis (disease) (disease) has a d... \n",
"\n",
" edge_attr \n",
"0 [0.061832674, 0.040013667, -0.15366873, -0.008... \n",
"1 [0.050393466, 0.030410834, -0.15008788, -0.013... \n",
"2 [0.0401622, 0.028982995, -0.15433805, 0.006565... \n",
"3 [0.02781422, 0.01603875, -0.14870141, 0.004470... \n",
"4 [0.07853663, 0.050751355, -0.1470567, -0.01237... "
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Add them as features to the dataframe\n",
"primekg_ibd_edges_df['edge_attr'] = edge_embeddings\n",
"\n",
"# Drop and rename several columns\n",
"primekg_ibd_edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'display_relation', 'relation'], inplace=True)\n",
"primekg_ibd_edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)\n",
"\n",
"# Check dataframe of edges\n",
"primekg_ibd_edges_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"# Save the embedded nodes dataframes to parquet file\n",
"primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges_embedded.parquet'), compression='gzip', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Knowledge Graph Construction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this section, we would like to convert our dataframes to networkx `DiGraph` object."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_64662/4233198491.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df[\"node\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n",
"/tmp/ipykernel_64662/4233198491.py:3: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" primekg_ibd_nodes_df[\"node_id\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_id</th>\n",
" <th>node_name</th>\n",
" <th>node_type</th>\n",
" <th>enriched_node</th>\n",
" <th>x</th>\n",
" </tr>\n",
" <tr>\n",
" <th>node</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>SMAD3_(144)</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>SMAD3</td>\n",
" <td>gene/protein</td>\n",
" <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
" <td>[0.026536005, 0.05420931, -0.17033643, -0.0248...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>IL10RB_(179)</th>\n",
" <td>IL10RB_(179)</td>\n",
" <td>IL10RB</td>\n",
" <td>gene/protein</td>\n",
" <td>IL10RB belongs to gene/protein category. The p...</td>\n",
" <td>[0.024764946, 0.022782002, -0.16956052, -0.033...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GNA12_(192)</th>\n",
" <td>GNA12_(192)</td>\n",
" <td>GNA12</td>\n",
" <td>gene/protein</td>\n",
" <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
" <td>[0.004795947, 0.04921528, -0.14488313, -0.0492...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>HNF4A_(279)</th>\n",
" <td>HNF4A_(279)</td>\n",
" <td>HNF4A</td>\n",
" <td>gene/protein</td>\n",
" <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
" <td>[0.013905027, 0.032602787, -0.15260702, 0.0074...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>VCAM1_(417)</th>\n",
" <td>VCAM1_(417)</td>\n",
" <td>VCAM1</td>\n",
" <td>gene/protein</td>\n",
" <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
" <td>[0.047299746, 0.032621186, -0.15677826, -0.021...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" node_id node_name node_type \\\n",
"node \n",
"SMAD3_(144) SMAD3_(144) SMAD3 gene/protein \n",
"IL10RB_(179) IL10RB_(179) IL10RB gene/protein \n",
"GNA12_(192) GNA12_(192) GNA12 gene/protein \n",
"HNF4A_(279) HNF4A_(279) HNF4A gene/protein \n",
"VCAM1_(417) VCAM1_(417) VCAM1 gene/protein \n",
"\n",
" enriched_node \\\n",
"node \n",
"SMAD3_(144) SMAD3 belongs to gene/protein category. The SM... \n",
"IL10RB_(179) IL10RB belongs to gene/protein category. The p... \n",
"GNA12_(192) GNA12 belongs to gene/protein category. Predic... \n",
"HNF4A_(279) HNF4A belongs to gene/protein category. The pr... \n",
"VCAM1_(417) VCAM1 belongs to gene/protein category. This g... \n",
"\n",
" x \n",
"node \n",
"SMAD3_(144) [0.026536005, 0.05420931, -0.17033643, -0.0248... \n",
"IL10RB_(179) [0.024764946, 0.022782002, -0.16956052, -0.033... \n",
"GNA12_(192) [0.004795947, 0.04921528, -0.14488313, -0.0492... \n",
"HNF4A_(279) [0.013905027, 0.032602787, -0.15260702, 0.0074... \n",
"VCAM1_(417) [0.047299746, 0.032621186, -0.15677826, -0.021... "
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Modify the node dataframe\n",
"primekg_ibd_nodes_df[\"node\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n",
"primekg_ibd_nodes_df[\"node_id\"] = primekg_ibd_nodes_df.apply(lambda x: f\"{x.node_name}_({x.node_id})\", axis=1)\n",
"primekg_ibd_nodes_df.set_index('node', inplace=True)\n",
"primekg_ibd_nodes_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_id</th>\n",
" <th>head_name</th>\n",
" <th>tail_id</th>\n",
" <th>tail_name</th>\n",
" <th>edge_type</th>\n",
" <th>enriched_edge</th>\n",
" <th>edge_attr</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ulcerative colitis (disease)_(37785)</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>ADCY7_(7359)</td>\n",
" <td>ADCY7</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" <td>[0.061832674, 0.040013667, -0.15366873, -0.008...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>inflammatory bowel disease_(28158)</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>ADCY7_(7359)</td>\n",
" <td>ADCY7</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>inflammatory bowel disease (disease) has a dir...</td>\n",
" <td>[0.050393466, 0.030410834, -0.15008788, -0.013...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ulcerative colitis (disease)_(37785)</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>PRDM1_(2874)</td>\n",
" <td>PRDM1</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" <td>[0.0401622, 0.028982995, -0.15433805, 0.006565...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>inflammatory bowel disease_(28158)</td>\n",
" <td>inflammatory bowel disease</td>\n",
" <td>PRDM1_(2874)</td>\n",
" <td>PRDM1</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>inflammatory bowel disease (disease) has a dir...</td>\n",
" <td>[0.02781422, 0.01603875, -0.14870141, 0.004470...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ulcerative colitis (disease)_(37785)</td>\n",
" <td>ulcerative colitis (disease)</td>\n",
" <td>CASP3_(2712)</td>\n",
" <td>CASP3</td>\n",
" <td>(disease, associated with, gene/protein)</td>\n",
" <td>ulcerative colitis (disease) (disease) has a d...</td>\n",
" <td>[0.07853663, 0.050751355, -0.1470567, -0.01237...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" head_id head_name \\\n",
"0 ulcerative colitis (disease)_(37785) ulcerative colitis (disease) \n",
"1 inflammatory bowel disease_(28158) inflammatory bowel disease \n",
"2 ulcerative colitis (disease)_(37785) ulcerative colitis (disease) \n",
"3 inflammatory bowel disease_(28158) inflammatory bowel disease \n",
"4 ulcerative colitis (disease)_(37785) ulcerative colitis (disease) \n",
"\n",
" tail_id tail_name edge_type \\\n",
"0 ADCY7_(7359) ADCY7 (disease, associated with, gene/protein) \n",
"1 ADCY7_(7359) ADCY7 (disease, associated with, gene/protein) \n",
"2 PRDM1_(2874) PRDM1 (disease, associated with, gene/protein) \n",
"3 PRDM1_(2874) PRDM1 (disease, associated with, gene/protein) \n",
"4 CASP3_(2712) CASP3 (disease, associated with, gene/protein) \n",
"\n",
" enriched_edge \\\n",
"0 ulcerative colitis (disease) (disease) has a d... \n",
"1 inflammatory bowel disease (disease) has a dir... \n",
"2 ulcerative colitis (disease) (disease) has a d... \n",
"3 inflammatory bowel disease (disease) has a dir... \n",
"4 ulcerative colitis (disease) (disease) has a d... \n",
"\n",
" edge_attr \n",
"0 [0.061832674, 0.040013667, -0.15366873, -0.008... \n",
"1 [0.050393466, 0.030410834, -0.15008788, -0.013... \n",
"2 [0.0401622, 0.028982995, -0.15433805, 0.006565... \n",
"3 [0.02781422, 0.01603875, -0.14870141, 0.004470... \n",
"4 [0.07853663, 0.050751355, -0.1470567, -0.01237... "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Modify the edge dataframe\n",
"primekg_ibd_edges_df[\"head_id\"] = primekg_ibd_edges_df.apply(lambda x: f\"{x.head_name}_({x.head_id})\", axis=1)\n",
"primekg_ibd_edges_df[\"tail_id\"] = primekg_ibd_edges_df.apply(lambda x: f\"{x.tail_name}_({x.tail_id})\", axis=1)\n",
"primekg_ibd_edges_df.reset_index(drop=True, inplace=True)\n",
"primekg_ibd_edges_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"# # Convert dataframes to knowledge graph as networkx object\n",
"kg = nx.DiGraph()\n",
"for i, row in primekg_ibd_nodes_df.iterrows():\n",
" kg.add_node(row['node_id'], **row.to_dict())\n",
"for i, row in primekg_ibd_edges_df.iterrows():\n",
" kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())\n"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"# Save graph object\n",
"local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'\n",
"with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'wb') as f:\n",
" pickle.dump(kg, f)\n",
"\n",
"# # Load graph object\n",
"# with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'rb') as f:\n",
"# kg = pickle.load(f)\n"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#Nodes 3426\n",
"#Edges 12752\n"
]
}
],
"source": [
"print (\"#Nodes\", kg.number_of_nodes())\n",
"print (\"#Edges\", kg.number_of_edges())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, we can convert the networkx graph to PyG `Data` object for further processing (e.g., subgraph extraction)."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# Convert networkx graph to PyG data object\n",
"pyg_graph = from_networkx(kg)\n",
"\n",
"# Save graph object\n",
"with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'wb') as f:\n",
" pickle.dump(pyg_graph, f)\n",
"\n",
"# Load graph object\n",
"# with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:\n",
"# pyg_graph = pickle.load(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, we are going to prepare a textualized graph of nodes and edges for RAG application, for instance.\n"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_id</th>\n",
" <th>node_attr</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>SMAD3 belongs to gene/protein category. The SM...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>IL10RB_(179)</td>\n",
" <td>IL10RB belongs to gene/protein category. The p...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GNA12_(192)</td>\n",
" <td>GNA12 belongs to gene/protein category. Predic...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>HNF4A_(279)</td>\n",
" <td>HNF4A belongs to gene/protein category. The pr...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>VCAM1_(417)</td>\n",
" <td>VCAM1 belongs to gene/protein category. This g...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3421</th>\n",
" <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
" <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3422</th>\n",
" <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
" <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3423</th>\n",
" <td>Antigen processing: Ubiquitination & Proteasom...</td>\n",
" <td>Antigen processing: Ubiquitination & Proteasom...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3424</th>\n",
" <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
" <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3425</th>\n",
" <td>Kinesins_(129367)</td>\n",
" <td>Kinesins belongs to pathway category. This pat...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3426 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" node_id \\\n",
"0 SMAD3_(144) \n",
"1 IL10RB_(179) \n",
"2 GNA12_(192) \n",
"3 HNF4A_(279) \n",
"4 VCAM1_(417) \n",
"... ... \n",
"3421 IRAK2 mediated activation of TAK1 complex upon... \n",
"3422 TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... \n",
"3423 Antigen processing: Ubiquitination & Proteasom... \n",
"3424 Antigen Presentation: Folding, assembly and pe... \n",
"3425 Kinesins_(129367) \n",
"\n",
" node_attr \n",
"0 SMAD3 belongs to gene/protein category. The SM... \n",
"1 IL10RB belongs to gene/protein category. The p... \n",
"2 GNA12 belongs to gene/protein category. Predic... \n",
"3 HNF4A belongs to gene/protein category. The pr... \n",
"4 VCAM1 belongs to gene/protein category. This g... \n",
"... ... \n",
"3421 IRAK2 mediated activation of TAK1 complex upon... \n",
"3422 TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... \n",
"3423 Antigen processing: Ubiquitination & Proteasom... \n",
"3424 Antigen Presentation: Folding, assembly and pe... \n",
"3425 Kinesins belongs to pathway category. This pat... \n",
"\n",
"[3426 rows x 2 columns]"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Prepare nodes\n",
"nodes_df = pd.DataFrame({\n",
" 'node_id': list(pyg_graph.node_id),\n",
" 'node_attr': list(pyg_graph.enriched_node),\n",
"})\n",
"nodes_df"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_id</th>\n",
" <th>edge_type</th>\n",
" <th>tail_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>(gene/protein, associated with, disease)</td>\n",
" <td>Crohn disease_(37784)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>(gene/protein, associated with, disease)</td>\n",
" <td>inflammatory bowel disease_(28158)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>(gene/protein, associated with, disease)</td>\n",
" <td>Crohn's colitis_(83770)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>(gene/protein, associated with, disease)</td>\n",
" <td>Crohn ileitis and jejunitis_(35814)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>SMAD3_(144)</td>\n",
" <td>(gene/protein, interacts with, pathway)</td>\n",
" <td>Signaling by NODAL_(62373)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12747</th>\n",
" <td>IRAK2 mediated activation of TAK1 complex upon...</td>\n",
" <td>(pathway, interacts with, gene/protein)</td>\n",
" <td>TLR4_(3259)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12748</th>\n",
" <td>TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...</td>\n",
" <td>(pathway, interacts with, gene/protein)</td>\n",
" <td>TLR9_(10113)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12749</th>\n",
" <td>Antigen processing: Ubiquitination & Proteasom...</td>\n",
" <td>(pathway, interacts with, gene/protein)</td>\n",
" <td>HERC2_(1777)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12750</th>\n",
" <td>Antigen Presentation: Folding, assembly and pe...</td>\n",
" <td>(pathway, interacts with, gene/protein)</td>\n",
" <td>ERAP2_(12763)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12751</th>\n",
" <td>Kinesins_(129367)</td>\n",
" <td>(pathway, interacts with, gene/protein)</td>\n",
" <td>KIF21B_(8564)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>12752 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" head_id \\\n",
"0 SMAD3_(144) \n",
"1 SMAD3_(144) \n",
"2 SMAD3_(144) \n",
"3 SMAD3_(144) \n",
"4 SMAD3_(144) \n",
"... ... \n",
"12747 IRAK2 mediated activation of TAK1 complex upon... \n",
"12748 TRAF6 mediated IRF7 activation in TLR7/8 or 9 ... \n",
"12749 Antigen processing: Ubiquitination & Proteasom... \n",
"12750 Antigen Presentation: Folding, assembly and pe... \n",
"12751 Kinesins_(129367) \n",
"\n",
" edge_type \\\n",
"0 (gene/protein, associated with, disease) \n",
"1 (gene/protein, associated with, disease) \n",
"2 (gene/protein, associated with, disease) \n",
"3 (gene/protein, associated with, disease) \n",
"4 (gene/protein, interacts with, pathway) \n",
"... ... \n",
"12747 (pathway, interacts with, gene/protein) \n",
"12748 (pathway, interacts with, gene/protein) \n",
"12749 (pathway, interacts with, gene/protein) \n",
"12750 (pathway, interacts with, gene/protein) \n",
"12751 (pathway, interacts with, gene/protein) \n",
"\n",
" tail_id \n",
"0 Crohn disease_(37784) \n",
"1 inflammatory bowel disease_(28158) \n",
"2 Crohn's colitis_(83770) \n",
"3 Crohn ileitis and jejunitis_(35814) \n",
"4 Signaling by NODAL_(62373) \n",
"... ... \n",
"12747 TLR4_(3259) \n",
"12748 TLR9_(10113) \n",
"12749 HERC2_(1777) \n",
"12750 ERAP2_(12763) \n",
"12751 KIF21B_(8564) \n",
"\n",
"[12752 rows x 3 columns]"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Prepare edges\n",
"edges_df = pd.DataFrame({\n",
" 'head_id': list(pyg_graph.head_id),\n",
" 'edge_type': list(pyg_graph.edge_type),\n",
" 'tail_id': list(pyg_graph.tail_id),\n",
"})\n",
"edges_df"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"with open(os.path.join(local_dir, 'primekg_ibd_text_graph.pkl'), \"wb\") as f:\n",
" pickle.dump({\"nodes\": nodes_df, \"edges\": edges_df}, f)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}