{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic API Usage of KGWAS\n", "\n", "KGWAS consists of two main class `KGWAS` and `KGWAS_Data`. `KGWAS` is the main class for the KGWAS model, and `KGWAS_Data` is the class for the data manipulation. In default, to ensure fast user experience, we provide a default fast mode of KGWAS, which uses Enformer embedding for variant feature and ESM embedding for gene features (instead of the baselineLD for variant and PoPS for gene since they are large files). For the fast mode, you do not need to download any data, the KGWAS API will automatically download the relevant files. This mode can be used to apply KGWAS to your own GWAS sumstats. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All required data files are present.\n", "--loading KG---\n", "--using enformer SNP embedding--\n", "--using random go embedding--\n", "--using ESM gene embedding--\n" ] } ], "source": [ "import sys\n", "sys.path.append('../')\n", "\n", "from kgwas import KGWAS, KGWAS_Data\n", "data = KGWAS_Data(data_path = './data/')\n", "data.load_kg()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the data needed for training is downloaded from the server and the knowledge graph is loaded. Next, we load the GWAS file. Here, we are using an example GWAS file, which is also automatically downloaded from the server. But you can also use your own GWAS file. The GWAS file should be in the format of a pandas DataFrame with columns `CHR`/`#CHROM`, `SNP`, `P`, `N`. Note that at the moment, our knowledge graph is UKBioBank directly genotyped variant set so it will automatically takes the overlap with the KG. Current efforts are underway for improving the coverage of the KG." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading example GWAS file...\n", "Example file already exists locally.\n", "Loading GWAS file from ./data/biochemistry_Creatinine_fastgwa_full_10000_1.fastGWA...\n", "Number of SNPs in the KG: 784256\n", "Number of SNPs in the GWAS: 542758\n", "Number of SNPs in the KG variant set: 542758\n", "Using ldsc weight...\n", "ldsc_weight mean: 0.9999999999999993\n" ] } ], "source": [ "data.load_external_gwas(example_file = True)\n", "data.process_gwas_file()\n", "data.prepare_split()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | #CHROM | \n", "ID | \n", "POS | \n", "A1 | \n", "A2 | \n", "N | \n", "AF1 | \n", "BETA | \n", "SE | \n", "P | \n", "ld_score | \n", "w_ld_score | \n", "y | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "rs3131962 | \n", "756604 | \n", "A | \n", "G | \n", "9988 | \n", "0.131007 | \n", "-0.117134 | \n", "0.246231 | \n", "0.634282 | \n", "72.862240 | \n", "4.474788 | \n", "0.226298 | \n", "
1 | \n", "1 | \n", "rs12562034 | \n", "768448 | \n", "A | \n", "G | \n", "9978 | \n", "0.104981 | \n", "-0.064894 | \n", "0.273746 | \n", "0.812611 | \n", "34.749233 | \n", "1.877341 | \n", "0.056197 | \n", "
2 | \n", "1 | \n", "rs4040617 | \n", "779322 | \n", "G | \n", "A | \n", "9975 | \n", "0.129123 | \n", "-0.001462 | \n", "0.247254 | \n", "0.995281 | \n", "72.271390 | \n", "4.208873 | \n", "0.000035 | \n", "
3 | \n", "1 | \n", "rs79373928 | \n", "801536 | \n", "G | \n", "T | \n", "9994 | \n", "0.014659 | \n", "0.081544 | \n", "0.688261 | \n", "0.905688 | \n", "16.740126 | \n", "1.949177 | \n", "0.014037 | \n", "
4 | \n", "1 | \n", "rs11240779 | \n", "808631 | \n", "G | \n", "A | \n", "9919 | \n", "0.226737 | \n", "-0.184268 | \n", "0.198982 | \n", "0.354418 | \n", "50.215000 | \n", "2.825456 | \n", "0.857575 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
542753 | \n", "22 | \n", "rs73174435 | \n", "51174939 | \n", "T | \n", "C | \n", "9979 | \n", "0.056118 | \n", "-0.158762 | \n", "0.362390 | \n", "0.661316 | \n", "21.981667 | \n", "1.363001 | \n", "0.191929 | \n", "
542754 | \n", "22 | \n", "rs3810648 | \n", "51175626 | \n", "G | \n", "A | \n", "9931 | \n", "0.058856 | \n", "0.272493 | \n", "0.352508 | \n", "0.439515 | \n", "34.619377 | \n", "1.804193 | \n", "0.597548 | \n", "
542755 | \n", "22 | \n", "rs5771002 | \n", "51183255 | \n", "A | \n", "G | \n", "9840 | \n", "0.333638 | \n", "0.116325 | \n", "0.175675 | \n", "0.507869 | \n", "16.231083 | \n", "1.273770 | \n", "0.438456 | \n", "
542756 | \n", "22 | \n", "rs3865764 | \n", "51185848 | \n", "G | \n", "A | \n", "9974 | \n", "0.051133 | \n", "-0.026670 | \n", "0.376132 | \n", "0.943472 | \n", "18.649513 | \n", "1.010000 | \n", "0.005028 | \n", "
542757 | \n", "22 | \n", "rs142680588 | \n", "51193629 | \n", "G | \n", "A | \n", "9981 | \n", "0.076595 | \n", "-0.109532 | \n", "0.312971 | \n", "0.726358 | \n", "52.471287 | \n", "1.873861 | \n", "0.122482 | \n", "
542758 rows × 13 columns
\n", "\n", " | #CHROM | \n", "ID | \n", "POS | \n", "A1 | \n", "A2 | \n", "N | \n", "AF1 | \n", "BETA | \n", "SE | \n", "P | \n", "ld_score | \n", "w_ld_score | \n", "y | \n", "pred | \n", "P_weighted | \n", "KGWAS_P | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "rs3131962 | \n", "756604 | \n", "A | \n", "G | \n", "9988 | \n", "0.131007 | \n", "-0.117134 | \n", "0.246231 | \n", "0.634282 | \n", "72.862240 | \n", "4.474788 | \n", "0.226298 | \n", "1.082365 | \n", "0.234167 | \n", "0.346428 | \n", "
1 | \n", "1 | \n", "rs12562034 | \n", "768448 | \n", "A | \n", "G | \n", "9978 | \n", "0.104981 | \n", "-0.064894 | \n", "0.273746 | \n", "0.812611 | \n", "34.749233 | \n", "1.877341 | \n", "0.056197 | \n", "1.087724 | \n", "0.382894 | \n", "0.566456 | \n", "
2 | \n", "1 | \n", "rs4040617 | \n", "779322 | \n", "G | \n", "A | \n", "9975 | \n", "0.129123 | \n", "-0.001462 | \n", "0.247254 | \n", "0.995281 | \n", "72.271390 | \n", "4.208873 | \n", "0.000035 | \n", "1.058530 | \n", "0.995281 | \n", "1 | \n", "
3 | \n", "1 | \n", "rs79373928 | \n", "801536 | \n", "G | \n", "T | \n", "9994 | \n", "0.014659 | \n", "0.081544 | \n", "0.688261 | \n", "0.905688 | \n", "16.740126 | \n", "1.949177 | \n", "0.014037 | \n", "1.105125 | \n", "0.225107 | \n", "0.333025 | \n", "
4 | \n", "1 | \n", "rs11240779 | \n", "808631 | \n", "G | \n", "A | \n", "9919 | \n", "0.226737 | \n", "-0.184268 | \n", "0.198982 | \n", "0.354418 | \n", "50.215000 | \n", "2.825456 | \n", "0.857575 | \n", "1.081468 | \n", "0.041646 | \n", "0.061612 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
542753 | \n", "22 | \n", "rs73174435 | \n", "51174939 | \n", "T | \n", "C | \n", "9979 | \n", "0.056118 | \n", "-0.158762 | \n", "0.362390 | \n", "0.661316 | \n", "21.981667 | \n", "1.363001 | \n", "0.191929 | \n", "1.008835 | \n", "0.233609 | \n", "0.345602 | \n", "
542754 | \n", "22 | \n", "rs3810648 | \n", "51175626 | \n", "G | \n", "A | \n", "9931 | \n", "0.058856 | \n", "0.272493 | \n", "0.352508 | \n", "0.439515 | \n", "34.619377 | \n", "1.804193 | \n", "0.597548 | \n", "1.034187 | \n", "0.439515 | \n", "0.650221 | \n", "
542755 | \n", "22 | \n", "rs5771002 | \n", "51183255 | \n", "A | \n", "G | \n", "9840 | \n", "0.333638 | \n", "0.116325 | \n", "0.175675 | \n", "0.507869 | \n", "16.231083 | \n", "1.273770 | \n", "0.438456 | \n", "1.093221 | \n", "0.449038 | \n", "0.66431 | \n", "
542756 | \n", "22 | \n", "rs3865764 | \n", "51185848 | \n", "G | \n", "A | \n", "9974 | \n", "0.051133 | \n", "-0.026670 | \n", "0.376132 | \n", "0.943472 | \n", "18.649513 | \n", "1.010000 | \n", "0.005028 | \n", "0.987747 | \n", "0.943472 | \n", "1 | \n", "
542757 | \n", "22 | \n", "rs142680588 | \n", "51193629 | \n", "G | \n", "A | \n", "9981 | \n", "0.076595 | \n", "-0.109532 | \n", "0.312971 | \n", "0.726358 | \n", "52.471287 | \n", "1.873861 | \n", "0.122482 | \n", "1.082649 | \n", "0.26816 | \n", "0.396718 | \n", "
542758 rows × 16 columns
\n", "\n", " | #CHROM | \n", "ID | \n", "POS | \n", "A1 | \n", "A2 | \n", "N | \n", "AF1 | \n", "BETA | \n", "SE | \n", "P | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "rs3131962 | \n", "756604 | \n", "A | \n", "G | \n", "4993 | \n", "0.129882 | \n", "14.559400 | \n", "17.1871 | \n", "0.396933 | \n", "
1 | \n", "1 | \n", "rs12562034 | \n", "768448 | \n", "A | \n", "G | \n", "4994 | \n", "0.103124 | \n", "-15.034400 | \n", "19.0234 | \n", "0.429345 | \n", "
2 | \n", "1 | \n", "rs4040617 | \n", "779322 | \n", "G | \n", "A | \n", "4979 | \n", "0.127435 | \n", "15.537200 | \n", "17.3933 | \n", "0.371704 | \n", "
3 | \n", "1 | \n", "rs79373928 | \n", "801536 | \n", "G | \n", "T | \n", "4996 | \n", "0.015012 | \n", "16.142600 | \n", "47.7752 | \n", "0.735448 | \n", "
4 | \n", "1 | \n", "rs11240779 | \n", "808631 | \n", "G | \n", "A | \n", "4961 | \n", "0.222233 | \n", "0.859838 | \n", "13.9158 | \n", "0.950731 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
542753 | \n", "22 | \n", "rs73174435 | \n", "51174939 | \n", "T | \n", "C | \n", "4991 | \n", "0.057103 | \n", "53.082400 | \n", "24.8130 | \n", "0.032412 | \n", "
542754 | \n", "22 | \n", "rs3810648 | \n", "51175626 | \n", "G | \n", "A | \n", "4959 | \n", "0.066243 | \n", "17.689800 | \n", "23.2562 | \n", "0.446867 | \n", "
542755 | \n", "22 | \n", "rs5771002 | \n", "51183255 | \n", "A | \n", "G | \n", "4937 | \n", "0.334414 | \n", "-12.170400 | \n", "12.3314 | \n", "0.323670 | \n", "
542756 | \n", "22 | \n", "rs3865764 | \n", "51185848 | \n", "G | \n", "A | \n", "4984 | \n", "0.050662 | \n", "-43.871900 | \n", "26.3007 | \n", "0.095299 | \n", "
542757 | \n", "22 | \n", "rs142680588 | \n", "51193629 | \n", "G | \n", "A | \n", "4994 | \n", "0.073388 | \n", "11.338700 | \n", "22.2066 | \n", "0.609630 | \n", "
542758 rows × 10 columns
\n", "\n", " | #CHROM | \n", "ID | \n", "POS | \n", "A1 | \n", "A2 | \n", "N | \n", "AF1 | \n", "BETA | \n", "SE | \n", "P | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "rs3131962 | \n", "756604 | \n", "A | \n", "G | \n", "4993 | \n", "0.129882 | \n", "-2.960260 | \n", "7.66276 | \n", "0.699261 | \n", "
1 | \n", "1 | \n", "rs12562034 | \n", "768448 | \n", "A | \n", "G | \n", "4994 | \n", "0.103124 | \n", "-19.335700 | \n", "8.47710 | \n", "0.022552 | \n", "
2 | \n", "1 | \n", "rs4040617 | \n", "779322 | \n", "G | \n", "A | \n", "4979 | \n", "0.127435 | \n", "-3.287600 | \n", "7.75475 | \n", "0.671605 | \n", "
3 | \n", "1 | \n", "rs79373928 | \n", "801536 | \n", "G | \n", "T | \n", "4996 | \n", "0.015012 | \n", "-12.530000 | \n", "21.29860 | \n", "0.556329 | \n", "
4 | \n", "1 | \n", "rs11240779 | \n", "808631 | \n", "G | \n", "A | \n", "4961 | \n", "0.222233 | \n", "-8.564830 | \n", "6.20273 | \n", "0.167335 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
542753 | \n", "22 | \n", "rs73174435 | \n", "51174939 | \n", "T | \n", "C | \n", "4991 | \n", "0.057103 | \n", "-24.859400 | \n", "11.06160 | \n", "0.024617 | \n", "
542754 | \n", "22 | \n", "rs3810648 | \n", "51175626 | \n", "G | \n", "A | \n", "4959 | \n", "0.066243 | \n", "-0.725793 | \n", "10.36870 | \n", "0.944195 | \n", "
542755 | \n", "22 | \n", "rs5771002 | \n", "51183255 | \n", "A | \n", "G | \n", "4937 | \n", "0.334414 | \n", "-5.555300 | \n", "5.49753 | \n", "0.312251 | \n", "
542756 | \n", "22 | \n", "rs3865764 | \n", "51185848 | \n", "G | \n", "A | \n", "4984 | \n", "0.050662 | \n", "12.588200 | \n", "11.72730 | \n", "0.283085 | \n", "
542757 | \n", "22 | \n", "rs142680588 | \n", "51193629 | \n", "G | \n", "A | \n", "4994 | \n", "0.073388 | \n", "-13.533700 | \n", "9.89851 | \n", "0.171548 | \n", "
542758 rows × 10 columns
\n", "\n", " | #CHROM | \n", "POS | \n", "ID | \n", "REF | \n", "ALT | \n", "A1 | \n", "FIRTH? | \n", "TEST | \n", "OBS_CT | \n", "OR | \n", "LOG(OR)_SE | \n", "Z_STAT | \n", "P | \n", "ERRCODE | \n", "SNP | \n", "A2 | \n", "N | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "756604 | \n", "rs3131962 | \n", "G | \n", "A | \n", "A | \n", "Y | \n", "ADD | \n", "999 | \n", "1.241130 | \n", "0.209870 | \n", "1.029320 | \n", "0.303330 | \n", ". | \n", "rs3131962 | \n", "G | \n", "999 | \n", "
1 | \n", "1 | \n", "768448 | \n", "rs12562034 | \n", "G | \n", "A | \n", "A | \n", "Y | \n", "ADD | \n", "996 | \n", "0.433894 | \n", "0.285912 | \n", "-2.920330 | \n", "0.003497 | \n", ". | \n", "rs12562034 | \n", "G | \n", "996 | \n", "
2 | \n", "1 | \n", "779322 | \n", "rs4040617 | \n", "A | \n", "G | \n", "G | \n", "Y | \n", "ADD | \n", "996 | \n", "1.178310 | \n", "0.211892 | \n", "0.774379 | \n", "0.438707 | \n", ". | \n", "rs4040617 | \n", "A | \n", "996 | \n", "
3 | \n", "1 | \n", "801536 | \n", "rs79373928 | \n", "T | \n", "G | \n", "G | \n", "Y | \n", "ADD | \n", "998 | \n", "0.989852 | \n", "0.479159 | \n", "-0.021286 | \n", "0.983018 | \n", ". | \n", "rs79373928 | \n", "T | \n", "998 | \n", "
4 | \n", "1 | \n", "808631 | \n", "rs11240779 | \n", "A | \n", "G | \n", "G | \n", "Y | \n", "ADD | \n", "994 | \n", "0.880382 | \n", "0.173114 | \n", "-0.735930 | \n", "0.461773 | \n", ". | \n", "rs11240779 | \n", "A | \n", "994 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
542753 | \n", "22 | \n", "51174939 | \n", "rs73174435 | \n", "C | \n", "T | \n", "T | \n", "Y | \n", "ADD | \n", "999 | \n", "0.642727 | \n", "0.362564 | \n", "-1.219190 | \n", "0.222772 | \n", ". | \n", "rs73174435 | \n", "C | \n", "999 | \n", "
542754 | \n", "22 | \n", "51175626 | \n", "rs3810648 | \n", "A | \n", "G | \n", "G | \n", "Y | \n", "ADD | \n", "996 | \n", "0.752885 | \n", "0.286799 | \n", "-0.989690 | \n", "0.322326 | \n", ". | \n", "rs3810648 | \n", "A | \n", "996 | \n", "
542755 | \n", "22 | \n", "51183255 | \n", "rs5771002 | \n", "G | \n", "A | \n", "A | \n", "Y | \n", "ADD | \n", "981 | \n", "0.792577 | \n", "0.150356 | \n", "-1.546100 | \n", "0.122080 | \n", ". | \n", "rs5771002 | \n", "G | \n", "981 | \n", "
542756 | \n", "22 | \n", "51185848 | \n", "rs3865764 | \n", "A | \n", "G | \n", "G | \n", "Y | \n", "ADD | \n", "996 | \n", "1.004930 | \n", "0.386700 | \n", "0.012715 | \n", "0.989855 | \n", ". | \n", "rs3865764 | \n", "A | \n", "996 | \n", "
542757 | \n", "22 | \n", "51193629 | \n", "rs142680588 | \n", "A | \n", "G | \n", "G | \n", "Y | \n", "ADD | \n", "1000 | \n", "1.497360 | \n", "0.267489 | \n", "1.509230 | \n", "0.131240 | \n", ". | \n", "rs142680588 | \n", "A | \n", "1000 | \n", "
542758 rows × 17 columns
\n", "\n", " | #CHROM | \n", "ID | \n", "POS | \n", "A1 | \n", "A2 | \n", "N | \n", "AF1 | \n", "BETA | \n", "SE | \n", "P | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "rs3131962 | \n", "756604 | \n", "A | \n", "G | \n", "407023 | \n", "0.129655 | \n", "0.000286 | \n", "0.001048 | \n", "0.784760 | \n", "
1 | \n", "1 | \n", "rs12562034 | \n", "768448 | \n", "A | \n", "G | \n", "407057 | \n", "0.104966 | \n", "-0.001491 | \n", "0.001147 | \n", "0.193592 | \n", "
2 | \n", "1 | \n", "rs4040617 | \n", "779322 | \n", "G | \n", "A | \n", "406623 | \n", "0.127520 | \n", "0.000108 | \n", "0.001056 | \n", "0.918404 | \n", "
3 | \n", "1 | \n", "rs79373928 | \n", "801536 | \n", "G | \n", "T | \n", "407517 | \n", "0.014884 | \n", "0.004382 | \n", "0.002904 | \n", "0.131349 | \n", "
4 | \n", "1 | \n", "rs11240779 | \n", "808631 | \n", "G | \n", "A | \n", "404493 | \n", "0.224886 | \n", "-0.001155 | \n", "0.000846 | \n", "0.172345 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
542753 | \n", "22 | \n", "rs73174435 | \n", "51174939 | \n", "T | \n", "C | \n", "407201 | \n", "0.053846 | \n", "-0.001980 | \n", "0.001559 | \n", "0.203959 | \n", "
542754 | \n", "22 | \n", "rs3810648 | \n", "51175626 | \n", "G | \n", "A | \n", "404901 | \n", "0.060979 | \n", "0.001922 | \n", "0.001474 | \n", "0.192116 | \n", "
542755 | \n", "22 | \n", "rs5771002 | \n", "51183255 | \n", "A | \n", "G | \n", "401398 | \n", "0.333603 | \n", "-0.000165 | \n", "0.000751 | \n", "0.826494 | \n", "
542756 | \n", "22 | \n", "rs3865764 | \n", "51185848 | \n", "G | \n", "A | \n", "406611 | \n", "0.050601 | \n", "-0.001311 | \n", "0.001605 | \n", "0.413994 | \n", "
542757 | \n", "22 | \n", "rs142680588 | \n", "51193629 | \n", "G | \n", "A | \n", "407108 | \n", "0.075912 | \n", "-0.002861 | \n", "0.001329 | \n", "0.031362 | \n", "
542758 rows × 10 columns
\n", "