{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic API Usage of KGWAS\n", "\n", "KGWAS consists of two main class `KGWAS` and `KGWAS_Data`. `KGWAS` is the main class for the KGWAS model, and `KGWAS_Data` is the class for the data manipulation. In default, to ensure fast user experience, we provide a default fast mode of KGWAS, which uses Enformer embedding for variant feature and ESM embedding for gene features (instead of the baselineLD for variant and PoPS for gene since they are large files). For the fast mode, you do not need to download any data, the KGWAS API will automatically download the relevant files. This mode can be used to apply KGWAS to your own GWAS sumstats. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All required data files are present.\n", "--loading KG---\n", "--using enformer SNP embedding--\n", "--using random go embedding--\n", "--using ESM gene embedding--\n" ] } ], "source": [ "import sys\n", "sys.path.append('../')\n", "\n", "from kgwas import KGWAS, KGWAS_Data\n", "data = KGWAS_Data(data_path = './data/')\n", "data.load_kg()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the data needed for training is downloaded from the server and the knowledge graph is loaded. Next, we load the GWAS file. Here, we are using an example GWAS file, which is also automatically downloaded from the server. But you can also use your own GWAS file. The GWAS file should be in the format of a pandas DataFrame with columns `CHR`/`#CHROM`, `SNP`, `P`, `N`. Note that at the moment, our knowledge graph is UKBioBank directly genotyped variant set so it will automatically takes the overlap with the KG. Current efforts are underway for improving the coverage of the KG." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading example GWAS file...\n", "Example file already exists locally.\n", "Loading GWAS file from ./data/biochemistry_Creatinine_fastgwa_full_10000_1.fastGWA...\n", "Number of SNPs in the KG: 784256\n", "Number of SNPs in the GWAS: 542758\n", "Number of SNPs in the KG variant set: 542758\n", "Using ldsc weight...\n", "ldsc_weight mean: 0.9999999999999993\n" ] } ], "source": [ "data.load_external_gwas(example_file = True)\n", "data.process_gwas_file()\n", "data.prepare_split()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMIDPOSA1A2NAF1BETASEPld_scorew_ld_scorey
01rs3131962756604AG99880.131007-0.1171340.2462310.63428272.8622404.4747880.226298
11rs12562034768448AG99780.104981-0.0648940.2737460.81261134.7492331.8773410.056197
21rs4040617779322GA99750.129123-0.0014620.2472540.99528172.2713904.2088730.000035
31rs79373928801536GT99940.0146590.0815440.6882610.90568816.7401261.9491770.014037
41rs11240779808631GA99190.226737-0.1842680.1989820.35441850.2150002.8254560.857575
..........................................
54275322rs7317443551174939TC99790.056118-0.1587620.3623900.66131621.9816671.3630010.191929
54275422rs381064851175626GA99310.0588560.2724930.3525080.43951534.6193771.8041930.597548
54275522rs577100251183255AG98400.3336380.1163250.1756750.50786916.2310831.2737700.438456
54275622rs386576451185848GA99740.051133-0.0266700.3761320.94347218.6495131.0100000.005028
54275722rs14268058851193629GA99810.076595-0.1095320.3129710.72635852.4712871.8738610.122482
\n", "

542758 rows × 13 columns

\n", "
" ], "text/plain": [ " #CHROM ID POS A1 A2 N AF1 BETA \\\n", "0 1 rs3131962 756604 A G 9988 0.131007 -0.117134 \n", "1 1 rs12562034 768448 A G 9978 0.104981 -0.064894 \n", "2 1 rs4040617 779322 G A 9975 0.129123 -0.001462 \n", "3 1 rs79373928 801536 G T 9994 0.014659 0.081544 \n", "4 1 rs11240779 808631 G A 9919 0.226737 -0.184268 \n", "... ... ... ... .. .. ... ... ... \n", "542753 22 rs73174435 51174939 T C 9979 0.056118 -0.158762 \n", "542754 22 rs3810648 51175626 G A 9931 0.058856 0.272493 \n", "542755 22 rs5771002 51183255 A G 9840 0.333638 0.116325 \n", "542756 22 rs3865764 51185848 G A 9974 0.051133 -0.026670 \n", "542757 22 rs142680588 51193629 G A 9981 0.076595 -0.109532 \n", "\n", " SE P ld_score w_ld_score y \n", "0 0.246231 0.634282 72.862240 4.474788 0.226298 \n", "1 0.273746 0.812611 34.749233 1.877341 0.056197 \n", "2 0.247254 0.995281 72.271390 4.208873 0.000035 \n", "3 0.688261 0.905688 16.740126 1.949177 0.014037 \n", "4 0.198982 0.354418 50.215000 2.825456 0.857575 \n", "... ... ... ... ... ... \n", "542753 0.362390 0.661316 21.981667 1.363001 0.191929 \n", "542754 0.352508 0.439515 34.619377 1.804193 0.597548 \n", "542755 0.175675 0.507869 16.231083 1.273770 0.438456 \n", "542756 0.376132 0.943472 18.649513 1.010000 0.005028 \n", "542757 0.312971 0.726358 52.471287 1.873861 0.122482 \n", "\n", "[542758 rows x 13 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.lr_uni" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we are ready to train the model! Here we are using epoch = 1 for the demo purpose, but in reality, you should use a higher number of epochs for better performance." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Creating data loader...\n", "Start Training...\n", "Training Progress Epoch 1/1: 52%|█████▏ | 500/956 [12:56<15:47, 2.08s/it]Epoch 1 Step 501 Train Loss: 1.8115\n", "Training Progress Epoch 1/1: 100%|██████████| 956/956 [24:26<00:00, 1.53s/it]\n", "100%|██████████| 50/50 [00:58<00:00, 1.17s/it]\n", "Epoch 1: Validation MSE: 2.1730 Validation Pearson: 0.0096. \n", "Saving models to ./data//model/test\n", "100%|██████████| 54/54 [00:56<00:00, 1.04s/it]\n", "100%|██████████| 1061/1061 [05:40<00:00, 3.11it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "KGWAS prediction and p-values saved to ./data//model_pred/new_experiments/test_pred.csv\n" ] } ], "source": [ "run = KGWAS(data, device = 'cuda:9', exp_name = 'test')\n", "run.initialize_model()\n", "run.train(epoch = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output of the model is saved to `/model_pred/new_experiments/{exp_name}_pred.csv`. You can also load it via `run.kgwas_res`. The model is also saved to `/model/{exp_name}`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMIDPOSA1A2NAF1BETASEPld_scorew_ld_scoreypredP_weightedKGWAS_P
01rs3131962756604AG99880.131007-0.1171340.2462310.63428272.8622404.4747880.2262981.0823650.2341670.346428
11rs12562034768448AG99780.104981-0.0648940.2737460.81261134.7492331.8773410.0561971.0877240.3828940.566456
21rs4040617779322GA99750.129123-0.0014620.2472540.99528172.2713904.2088730.0000351.0585300.9952811
31rs79373928801536GT99940.0146590.0815440.6882610.90568816.7401261.9491770.0140371.1051250.2251070.333025
41rs11240779808631GA99190.226737-0.1842680.1989820.35441850.2150002.8254560.8575751.0814680.0416460.061612
...................................................
54275322rs7317443551174939TC99790.056118-0.1587620.3623900.66131621.9816671.3630010.1919291.0088350.2336090.345602
54275422rs381064851175626GA99310.0588560.2724930.3525080.43951534.6193771.8041930.5975481.0341870.4395150.650221
54275522rs577100251183255AG98400.3336380.1163250.1756750.50786916.2310831.2737700.4384561.0932210.4490380.66431
54275622rs386576451185848GA99740.051133-0.0266700.3761320.94347218.6495131.0100000.0050280.9877470.9434721
54275722rs14268058851193629GA99810.076595-0.1095320.3129710.72635852.4712871.8738610.1224821.0826490.268160.396718
\n", "

542758 rows × 16 columns

\n", "
" ], "text/plain": [ " #CHROM ID POS A1 A2 N AF1 BETA \\\n", "0 1 rs3131962 756604 A G 9988 0.131007 -0.117134 \n", "1 1 rs12562034 768448 A G 9978 0.104981 -0.064894 \n", "2 1 rs4040617 779322 G A 9975 0.129123 -0.001462 \n", "3 1 rs79373928 801536 G T 9994 0.014659 0.081544 \n", "4 1 rs11240779 808631 G A 9919 0.226737 -0.184268 \n", "... ... ... ... .. .. ... ... ... \n", "542753 22 rs73174435 51174939 T C 9979 0.056118 -0.158762 \n", "542754 22 rs3810648 51175626 G A 9931 0.058856 0.272493 \n", "542755 22 rs5771002 51183255 A G 9840 0.333638 0.116325 \n", "542756 22 rs3865764 51185848 G A 9974 0.051133 -0.026670 \n", "542757 22 rs142680588 51193629 G A 9981 0.076595 -0.109532 \n", "\n", " SE P ld_score w_ld_score y pred \\\n", "0 0.246231 0.634282 72.862240 4.474788 0.226298 1.082365 \n", "1 0.273746 0.812611 34.749233 1.877341 0.056197 1.087724 \n", "2 0.247254 0.995281 72.271390 4.208873 0.000035 1.058530 \n", "3 0.688261 0.905688 16.740126 1.949177 0.014037 1.105125 \n", "4 0.198982 0.354418 50.215000 2.825456 0.857575 1.081468 \n", "... ... ... ... ... ... ... \n", "542753 0.362390 0.661316 21.981667 1.363001 0.191929 1.008835 \n", "542754 0.352508 0.439515 34.619377 1.804193 0.597548 1.034187 \n", "542755 0.175675 0.507869 16.231083 1.273770 0.438456 1.093221 \n", "542756 0.376132 0.943472 18.649513 1.010000 0.005028 0.987747 \n", "542757 0.312971 0.726358 52.471287 1.873861 0.122482 1.082649 \n", "\n", " P_weighted KGWAS_P \n", "0 0.234167 0.346428 \n", "1 0.382894 0.566456 \n", "2 0.995281 1 \n", "3 0.225107 0.333025 \n", "4 0.041646 0.061612 \n", "... ... ... \n", "542753 0.233609 0.345602 \n", "542754 0.439515 0.650221 \n", "542755 0.449038 0.66431 \n", "542756 0.943472 1 \n", "542757 0.26816 0.396718 \n", "\n", "[542758 rows x 16 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "run.kgwas_res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If needed, you can load the pre-trained model via `run.load_pretrained()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run.load_pretrained('./data/model/test')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to (1) use the full mode of KGWAS (i.e. larger node embeddings) or (2) access the null/causal simulations or (3) access the 21 subsampled GWAS sumstats across various sample sizes or (4) analyze the KGWAS sumstats for subsampled data or (5) analyze the KGWAS sumstats for all UKBB ICD10 diseases, please use [this link](https://drive.google.com/file/d/14UcHzPRIbdMmnLPZCHx_4G-gz2pipeg9/view?usp=sharing). Note that this file is large (around 45GB) and may take a while to download. After unzipping it, you can use that directory as the data directory for the KGWAS API." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All required data files are present.\n" ] } ], "source": [ "from kgwas import KGWAS, KGWAS_Data\n", "data = KGWAS_Data(data_path = '/dfs/project/datasets/20220524-ukbiobank/data/kgwas_data/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you can use various variant, gene, and program embeddings. For example, for the result in the paper, we use the baselineLD for variant and PoPS for gene." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--loading KG---\n", "--using baselineLD SNP embedding--\n", "--using random go embedding--\n", "--using PoPs expression+PPI+pathways gene embedding--\n" ] } ], "source": [ "data.load_kg(snp_init_emb = 'baselineLD', \n", " go_init_emb = 'random',\n", " gene_init_emb = 'pops')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many alternative embeddings as well. \n", "- For variant: `enformer` (default), `baselineLD`, `SLDSC`, `cadd`, `kg`, `random`\n", "- For gene: `esm` (default), `pops_expression`, `pops`, `kg`, `random`\n", "- For program/go: `random` (default), `biogpt`, `kg`\n", "\n", "In additional to more embeddings, the full data folder contains summary statistics used in each analysis in the paper. For example, for the simulations, you can load it via:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All required data files are present.\n", "Using simulation data....\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMIDPOSA1A2NAF1BETASEP
01rs3131962756604AG49930.12988214.55940017.18710.396933
11rs12562034768448AG49940.103124-15.03440019.02340.429345
21rs4040617779322GA49790.12743515.53720017.39330.371704
31rs79373928801536GT49960.01501216.14260047.77520.735448
41rs11240779808631GA49610.2222330.85983813.91580.950731
.................................
54275322rs7317443551174939TC49910.05710353.08240024.81300.032412
54275422rs381064851175626GA49590.06624317.68980023.25620.446867
54275522rs577100251183255AG49370.334414-12.17040012.33140.323670
54275622rs386576451185848GA49840.050662-43.87190026.30070.095299
54275722rs14268058851193629GA49940.07338811.33870022.20660.609630
\n", "

542758 rows × 10 columns

\n", "
" ], "text/plain": [ " #CHROM ID POS A1 A2 N AF1 BETA \\\n", "0 1 rs3131962 756604 A G 4993 0.129882 14.559400 \n", "1 1 rs12562034 768448 A G 4994 0.103124 -15.034400 \n", "2 1 rs4040617 779322 G A 4979 0.127435 15.537200 \n", "3 1 rs79373928 801536 G T 4996 0.015012 16.142600 \n", "4 1 rs11240779 808631 G A 4961 0.222233 0.859838 \n", "... ... ... ... .. .. ... ... ... \n", "542753 22 rs73174435 51174939 T C 4991 0.057103 53.082400 \n", "542754 22 rs3810648 51175626 G A 4959 0.066243 17.689800 \n", "542755 22 rs5771002 51183255 A G 4937 0.334414 -12.170400 \n", "542756 22 rs3865764 51185848 G A 4984 0.050662 -43.871900 \n", "542757 22 rs142680588 51193629 G A 4994 0.073388 11.338700 \n", "\n", " SE P \n", "0 17.1871 0.396933 \n", "1 19.0234 0.429345 \n", "2 17.3933 0.371704 \n", "3 47.7752 0.735448 \n", "4 13.9158 0.950731 \n", "... ... ... \n", "542753 24.8130 0.032412 \n", "542754 23.2562 0.446867 \n", "542755 12.3314 0.323670 \n", "542756 26.3007 0.095299 \n", "542757 22.2066 0.609630 \n", "\n", "[542758 rows x 10 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.load_simulation_gwas('causal', seed = 1) # seed can range from 1-500\n", "data.lr_uni" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly for null simulations, you can load it via:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using simulation data....\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMIDPOSA1A2NAF1BETASEP
01rs3131962756604AG49930.129882-2.9602607.662760.699261
11rs12562034768448AG49940.103124-19.3357008.477100.022552
21rs4040617779322GA49790.127435-3.2876007.754750.671605
31rs79373928801536GT49960.015012-12.53000021.298600.556329
41rs11240779808631GA49610.222233-8.5648306.202730.167335
.................................
54275322rs7317443551174939TC49910.057103-24.85940011.061600.024617
54275422rs381064851175626GA49590.066243-0.72579310.368700.944195
54275522rs577100251183255AG49370.334414-5.5553005.497530.312251
54275622rs386576451185848GA49840.05066212.58820011.727300.283085
54275722rs14268058851193629GA49940.073388-13.5337009.898510.171548
\n", "

542758 rows × 10 columns

\n", "
" ], "text/plain": [ " #CHROM ID POS A1 A2 N AF1 BETA \\\n", "0 1 rs3131962 756604 A G 4993 0.129882 -2.960260 \n", "1 1 rs12562034 768448 A G 4994 0.103124 -19.335700 \n", "2 1 rs4040617 779322 G A 4979 0.127435 -3.287600 \n", "3 1 rs79373928 801536 G T 4996 0.015012 -12.530000 \n", "4 1 rs11240779 808631 G A 4961 0.222233 -8.564830 \n", "... ... ... ... .. .. ... ... ... \n", "542753 22 rs73174435 51174939 T C 4991 0.057103 -24.859400 \n", "542754 22 rs3810648 51175626 G A 4959 0.066243 -0.725793 \n", "542755 22 rs5771002 51183255 A G 4937 0.334414 -5.555300 \n", "542756 22 rs3865764 51185848 G A 4984 0.050662 12.588200 \n", "542757 22 rs142680588 51193629 G A 4994 0.073388 -13.533700 \n", "\n", " SE P \n", "0 7.66276 0.699261 \n", "1 8.47710 0.022552 \n", "2 7.75475 0.671605 \n", "3 21.29860 0.556329 \n", "4 6.20273 0.167335 \n", "... ... ... \n", "542753 11.06160 0.024617 \n", "542754 10.36870 0.944195 \n", "542755 5.49753 0.312251 \n", "542756 11.72730 0.283085 \n", "542757 9.89851 0.171548 \n", "\n", "[542758 rows x 10 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.load_simulation_gwas('null', seed = 1)# seed can range from 1-500\n", "data.lr_uni" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, for the subsampling analysis, you can load any trait out of the 21 subsampled traits in various sample sizes across 5 replicates. The phenotype list can be accessed via:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['body_BALDING1',\n", " 'disease_ALLERGY_ECZEMA_DIAGNOSED',\n", " 'disease_HYPOTHYROIDISM_SELF_REP',\n", " 'pigment_SUNBURN',\n", " '21001',\n", " '50',\n", " '30080',\n", " '30070',\n", " '30010',\n", " '30000',\n", " 'biochemistry_AlkalinePhosphatase',\n", " 'biochemistry_AspartateAminotransferase',\n", " 'biochemistry_Cholesterol',\n", " 'biochemistry_Creatinine',\n", " 'biochemistry_IGF1',\n", " 'biochemistry_Phosphate',\n", " 'biochemistry_Testosterone_Male',\n", " 'biochemistry_TotalBilirubin',\n", " 'biochemistry_TotalProtein',\n", " 'biochemistry_VitaminD',\n", " 'bmd_HEEL_TSCOREz']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.get_pheno_list()['21_indep_traits']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually each trait has the following sample sizes available: 1000, 2500, 5000, 7500, 10000, 50000, 100000, 200000. For example, to load body_BALDING1 at sample size 1000 at replicate 1, you can use:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMPOSIDREFALTA1FIRTH?TESTOBS_CTORLOG(OR)_SEZ_STATPERRCODESNPA2N
01756604rs3131962GAAYADD9991.2411300.2098701.0293200.303330.rs3131962G999
11768448rs12562034GAAYADD9960.4338940.285912-2.9203300.003497.rs12562034G996
21779322rs4040617AGGYADD9961.1783100.2118920.7743790.438707.rs4040617A996
31801536rs79373928TGGYADD9980.9898520.479159-0.0212860.983018.rs79373928T998
41808631rs11240779AGGYADD9940.8803820.173114-0.7359300.461773.rs11240779A994
......................................................
5427532251174939rs73174435CTTYADD9990.6427270.362564-1.2191900.222772.rs73174435C999
5427542251175626rs3810648AGGYADD9960.7528850.286799-0.9896900.322326.rs3810648A996
5427552251183255rs5771002GAAYADD9810.7925770.150356-1.5461000.122080.rs5771002G981
5427562251185848rs3865764AGGYADD9961.0049300.3867000.0127150.989855.rs3865764A996
5427572251193629rs142680588AGGYADD10001.4973600.2674891.5092300.131240.rs142680588A1000
\n", "

542758 rows × 17 columns

\n", "
" ], "text/plain": [ " #CHROM POS ID REF ALT A1 FIRTH? TEST OBS_CT \\\n", "0 1 756604 rs3131962 G A A Y ADD 999 \n", "1 1 768448 rs12562034 G A A Y ADD 996 \n", "2 1 779322 rs4040617 A G G Y ADD 996 \n", "3 1 801536 rs79373928 T G G Y ADD 998 \n", "4 1 808631 rs11240779 A G G Y ADD 994 \n", "... ... ... ... .. .. .. ... ... ... \n", "542753 22 51174939 rs73174435 C T T Y ADD 999 \n", "542754 22 51175626 rs3810648 A G G Y ADD 996 \n", "542755 22 51183255 rs5771002 G A A Y ADD 981 \n", "542756 22 51185848 rs3865764 A G G Y ADD 996 \n", "542757 22 51193629 rs142680588 A G G Y ADD 1000 \n", "\n", " OR LOG(OR)_SE Z_STAT P ERRCODE SNP A2 N \n", "0 1.241130 0.209870 1.029320 0.303330 . rs3131962 G 999 \n", "1 0.433894 0.285912 -2.920330 0.003497 . rs12562034 G 996 \n", "2 1.178310 0.211892 0.774379 0.438707 . rs4040617 A 996 \n", "3 0.989852 0.479159 -0.021286 0.983018 . rs79373928 T 998 \n", "4 0.880382 0.173114 -0.735930 0.461773 . rs11240779 A 994 \n", "... ... ... ... ... ... ... .. ... \n", "542753 0.642727 0.362564 -1.219190 0.222772 . rs73174435 C 999 \n", "542754 0.752885 0.286799 -0.989690 0.322326 . rs3810648 A 996 \n", "542755 0.792577 0.150356 -1.546100 0.122080 . rs5771002 G 981 \n", "542756 1.004930 0.386700 0.012715 0.989855 . rs3865764 A 996 \n", "542757 1.497360 0.267489 1.509230 0.131240 . rs142680588 A 1000 \n", "\n", "[542758 rows x 17 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.load_gwas_subsample(pheno = 'body_BALDING1', sample_size = 1000, seed = 1)\n", "data.lr_uni" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also load the full cohort GWAS for these 21 traits via:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#CHROMIDPOSA1A2NAF1BETASEP
01rs3131962756604AG4070230.1296550.0002860.0010480.784760
11rs12562034768448AG4070570.104966-0.0014910.0011470.193592
21rs4040617779322GA4066230.1275200.0001080.0010560.918404
31rs79373928801536GT4075170.0148840.0043820.0029040.131349
41rs11240779808631GA4044930.224886-0.0011550.0008460.172345
.................................
54275322rs7317443551174939TC4072010.053846-0.0019800.0015590.203959
54275422rs381064851175626GA4049010.0609790.0019220.0014740.192116
54275522rs577100251183255AG4013980.333603-0.0001650.0007510.826494
54275622rs386576451185848GA4066110.050601-0.0013110.0016050.413994
54275722rs14268058851193629GA4071080.075912-0.0028610.0013290.031362
\n", "

542758 rows × 10 columns

\n", "
" ], "text/plain": [ " #CHROM ID POS A1 A2 N AF1 BETA \\\n", "0 1 rs3131962 756604 A G 407023 0.129655 0.000286 \n", "1 1 rs12562034 768448 A G 407057 0.104966 -0.001491 \n", "2 1 rs4040617 779322 G A 406623 0.127520 0.000108 \n", "3 1 rs79373928 801536 G T 407517 0.014884 0.004382 \n", "4 1 rs11240779 808631 G A 404493 0.224886 -0.001155 \n", "... ... ... ... .. .. ... ... ... \n", "542753 22 rs73174435 51174939 T C 407201 0.053846 -0.001980 \n", "542754 22 rs3810648 51175626 G A 404901 0.060979 0.001922 \n", "542755 22 rs5771002 51183255 A G 401398 0.333603 -0.000165 \n", "542756 22 rs3865764 51185848 G A 406611 0.050601 -0.001311 \n", "542757 22 rs142680588 51193629 G A 407108 0.075912 -0.002861 \n", "\n", " SE P \n", "0 0.001048 0.784760 \n", "1 0.001147 0.193592 \n", "2 0.001056 0.918404 \n", "3 0.002904 0.131349 \n", "4 0.000846 0.172345 \n", "... ... ... \n", "542753 0.001559 0.203959 \n", "542754 0.001474 0.192116 \n", "542755 0.000751 0.826494 \n", "542756 0.001605 0.413994 \n", "542757 0.001329 0.031362 \n", "\n", "[542758 rows x 10 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.load_full_gwas(pheno = 'body_BALDING1')\n", "data.lr_uni" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the basic KGWAS interface! Check out the other notebooks for other capabilities of KGWAS!" ] } ], "metadata": { "kernelspec": { "display_name": "a100_env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 2 }