668 lines (667 with data), 19.5 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PrimeKG Loader"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we will explain how to load dataframes of PrimeKG containing the information of the entities and the relations of the knowledge graph.\n",
"\n",
"Prior information about the PrimeKG can be found in the following repositories:\n",
"- https://github.com/mims-harvard/PrimeKG\n",
"- https://github.com/mims-harvard/TDC/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we are leveraging the PrimeKG provided in Harvard Dataverse, which is publicly available in the following link:\n",
"\n",
"https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM\n",
"\n",
"By the time we are writing this tutorial, the latest version of PrimeKG (`kg.csv`) is `2.1`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First of all, we need to import necessary libraries as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\mulyadi\\Repo\\AIAgents4Pharma\\venv\\Lib\\site-packages\\pydantic\\_internal\\_fields.py:132: UserWarning: Field \"model_id\" in SysBioModel has conflict with protected namespace \"model_\".\n",
"\n",
"You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
" warnings.warn(\n",
"c:\\Users\\mulyadi\\Repo\\AIAgents4Pharma\\venv\\Lib\\site-packages\\pydantic\\_internal\\_fields.py:132: UserWarning: Field \"model_id\" in BasicoModel has conflict with protected namespace \"model_\".\n",
"\n",
"You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
" warnings.warn(\n",
"c:\\Users\\mulyadi\\Repo\\AIAgents4Pharma\\venv\\Lib\\site-packages\\pydantic\\_internal\\_fields.py:132: UserWarning: Field \"model_data\" in SimulateModelInput has conflict with protected namespace \"model_\".\n",
"\n",
"You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
" warnings.warn(\n"
]
}
],
"source": [
"# Import necessary libraries\n",
"import sys\n",
"sys.path.append('../../..')\n",
"from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load PrimeKG"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `PrimeKG` dataset allows to load the data from the Harvard Dataverse server if the data is not available locally. \n",
"\n",
"Otherwise, the data is loaded from the local directory as defined in the `local_dir`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Define primekg data by providing a local directory where the data is stored\n",
"primekg_data = PrimeKG(local_dir=\"../../../../data/primekg/\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To load the dataframes of nodes and edges from PrimeKG, we just need to invoke a method as follows."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading node file from https://dataverse.harvard.edu/api/access/datafile/6180617\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 8.89M/8.89M [00:01<00:00, 8.06MiB/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading edge file from https://dataverse.harvard.edu/api/access/datafile/6180616\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 387M/387M [00:35<00:00, 10.8MiB/s] \n"
]
}
],
"source": [
"# Invoke a method to load the data\n",
"primekg_data.load_data()\n",
"\n",
"# Get primekg_nodes and primekg_edges\n",
"primekg_nodes = primekg_data.get_nodes()\n",
"primekg_edges = primekg_data.get_edges()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check PrimeKG Dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned before, the primekg_nodes and primekg_edges are the dataframes of nodes and edges, respectively.\n",
"\n",
"We can further analyze the dataframes to extract the information we need.\n",
"\n",
"For instance, we can construct a graph from the nodes and edges dataframes using the networkx library."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### PrimeKG Nodes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`primekg_nodes` is a dataframe of nodes, which has the following columns:\n",
"- `node_index`: the index of the node\n",
"- `node`: the node name\n",
"- `node_id`: the id of the node (currently set as node name itself, for visualization purposes)\n",
"- `node_uid`: the unique identifier of the node (source name + unique id)\n",
"- `node_type`: the type of the node\n",
"\n",
"We can check a sample of the primekg nodes to see the list of nodes in the PrimeKG dataset as follows."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>node_index</th>\n",
" <th>node_name</th>\n",
" <th>node_source</th>\n",
" <th>node_id</th>\n",
" <th>node_type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>PHYHIP</td>\n",
" <td>NCBI</td>\n",
" <td>9796</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>GPANK1</td>\n",
" <td>NCBI</td>\n",
" <td>7918</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>ZRSR2</td>\n",
" <td>NCBI</td>\n",
" <td>8233</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>NRF1</td>\n",
" <td>NCBI</td>\n",
" <td>4899</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>PI4KA</td>\n",
" <td>NCBI</td>\n",
" <td>5297</td>\n",
" <td>gene/protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" node_index node_name node_source node_id node_type\n",
"0 0 PHYHIP NCBI 9796 gene/protein\n",
"1 1 GPANK1 NCBI 7918 gene/protein\n",
"2 2 ZRSR2 NCBI 8233 gene/protein\n",
"3 3 NRF1 NCBI 4899 gene/protein\n",
"4 4 PI4KA NCBI 5297 gene/protein"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check a sample of the primekg nodes\n",
"primekg_nodes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The current version of PrimeKG has about 130K of nodes in total as we can observe in the following cell."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(129375, 5)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check dimensions of the primekg nodes\n",
"primekg_nodes.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can breakdown the statistics of the primekg nodes by their types as follows."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"node_type\n",
"biological_process 28642\n",
"gene/protein 27671\n",
"disease 17080\n",
"effect/phenotype 15311\n",
"anatomy 14035\n",
"molecular_function 11169\n",
"drug 7957\n",
"cellular_component 4176\n",
"pathway 2516\n",
"exposure 818\n",
"Name: count, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show node types and their counts\n",
"primekg_nodes['node_type'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PrimeKG was built using various sources, as we can observe from their unique node sources as follows."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"node_source\n",
"GO 43987\n",
"NCBI 27671\n",
"MONDO 15813\n",
"HPO 15311\n",
"UBERON 14035\n",
"DrugBank 7957\n",
"REACTOME 2516\n",
"MONDO_grouped 1267\n",
"CTD 818\n",
"Name: count, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show source of the primekg nodes\n",
"primekg_nodes['node_source'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### PrimeKG Edges"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`primekg_edges` is a dataframe of edges, which has the following columns:\n",
"- `head_index`: the index of the head node\n",
"- `head_name`: the name of the head node\n",
"- `head_source`: the source database of head node\n",
"- `head_id`: the id in source database of head node\n",
"- `tail_index`: the index of the tail node\n",
"- `tail_name`: the name of the tail node\n",
"- `tail_source`: the source database of tail node\n",
"- `tail_id`: the id in source database of tail node\n",
"- `display_relation`: the type of the edge"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also check a sample of the primekg edges to see the interconnections between the nodes in the PrimeKG dataset as follows."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>head_index</th>\n",
" <th>head_name</th>\n",
" <th>head_source</th>\n",
" <th>head_id</th>\n",
" <th>head_type</th>\n",
" <th>tail_index</th>\n",
" <th>tail_name</th>\n",
" <th>tail_source</th>\n",
" <th>tail_id</th>\n",
" <th>tail_type</th>\n",
" <th>display_relation</th>\n",
" <th>relation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>PHYHIP</td>\n",
" <td>NCBI</td>\n",
" <td>9796</td>\n",
" <td>gene/protein</td>\n",
" <td>8889</td>\n",
" <td>KIF15</td>\n",
" <td>NCBI</td>\n",
" <td>56992</td>\n",
" <td>gene/protein</td>\n",
" <td>ppi</td>\n",
" <td>protein_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>GPANK1</td>\n",
" <td>NCBI</td>\n",
" <td>7918</td>\n",
" <td>gene/protein</td>\n",
" <td>2798</td>\n",
" <td>PNMA1</td>\n",
" <td>NCBI</td>\n",
" <td>9240</td>\n",
" <td>gene/protein</td>\n",
" <td>ppi</td>\n",
" <td>protein_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>ZRSR2</td>\n",
" <td>NCBI</td>\n",
" <td>8233</td>\n",
" <td>gene/protein</td>\n",
" <td>5646</td>\n",
" <td>TTC33</td>\n",
" <td>NCBI</td>\n",
" <td>23548</td>\n",
" <td>gene/protein</td>\n",
" <td>ppi</td>\n",
" <td>protein_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>NRF1</td>\n",
" <td>NCBI</td>\n",
" <td>4899</td>\n",
" <td>gene/protein</td>\n",
" <td>11592</td>\n",
" <td>MAN1B1</td>\n",
" <td>NCBI</td>\n",
" <td>11253</td>\n",
" <td>gene/protein</td>\n",
" <td>ppi</td>\n",
" <td>protein_protein</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>PI4KA</td>\n",
" <td>NCBI</td>\n",
" <td>5297</td>\n",
" <td>gene/protein</td>\n",
" <td>2122</td>\n",
" <td>RGS20</td>\n",
" <td>NCBI</td>\n",
" <td>8601</td>\n",
" <td>gene/protein</td>\n",
" <td>ppi</td>\n",
" <td>protein_protein</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" head_index head_name head_source head_id head_type tail_index \\\n",
"0 0 PHYHIP NCBI 9796 gene/protein 8889 \n",
"1 1 GPANK1 NCBI 7918 gene/protein 2798 \n",
"2 2 ZRSR2 NCBI 8233 gene/protein 5646 \n",
"3 3 NRF1 NCBI 4899 gene/protein 11592 \n",
"4 4 PI4KA NCBI 5297 gene/protein 2122 \n",
"\n",
" tail_name tail_source tail_id tail_type display_relation \\\n",
"0 KIF15 NCBI 56992 gene/protein ppi \n",
"1 PNMA1 NCBI 9240 gene/protein ppi \n",
"2 TTC33 NCBI 23548 gene/protein ppi \n",
"3 MAN1B1 NCBI 11253 gene/protein ppi \n",
"4 RGS20 NCBI 8601 gene/protein ppi \n",
"\n",
" relation \n",
"0 protein_protein \n",
"1 protein_protein \n",
"2 protein_protein \n",
"3 protein_protein \n",
"4 protein_protein "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check a sample of the primekg edges\n",
"primekg_edges.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The current version of PrimeKG has about 8.1M of edges in total as we can observe in the following cell."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8100498, 12)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check dimensions of the primekg nodes\n",
"primekg_edges.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can breakdown the statistics of the primekg edges by their types as follows."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"display_relation\n",
"expression present 3036406\n",
"synergistic interaction 2672628\n",
"interacts with 686550\n",
"ppi 642150\n",
"phenotype present 300634\n",
"parent-child 281744\n",
"associated with 167482\n",
"side effect 129568\n",
"contraindication 61350\n",
"expression absent 39774\n",
"target 32760\n",
"indication 18776\n",
"enzyme 10634\n",
"transporter 6184\n",
"off-label use 5136\n",
"linked to 4608\n",
"phenotype absent 2386\n",
"carrier 1728\n",
"Name: count, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show edge types and their counts\n",
"primekg_edges['display_relation'].value_counts()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}