[1bd6b5]: / notebooks / Diseases_and_datasets.ipynb

Download this file

1021 lines (1020 with data), 26.0 kB

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run notebook_setup.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Imported `literature` (904B0F94) at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import literature",
       "finished": "2020-08-08T05:59:00.149124",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "904B0F94",
          "sha256": "A2EFC068A287A3B724AE4B320EE5356E1E99474BD08A2E2A3EBA34CD0194F23B"
         },
         "subject": "literature"
        }
       ],
       "started": "2020-08-08T05:58:58.159275"
      }
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "Imported:\n",
       "\n",
       " - `predicted_article_types` (3D39430E)\n",
       " - `reliable_article_types` (5D584CB5)\n",
       "\n",
       "at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import predicted_article_types, reliable_article_types",
       "finished": "2020-08-08T05:59:01.134868",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "3D39430E",
          "sha256": "C434CF669D09A80085574C5EAF7D4B6154FF04EC1A2143DA15E42E464E3314E9"
         },
         "subject": "predicted_article_types"
        },
        {
         "new_file": {
          "crc32": "5D584CB5",
          "sha256": "585366F3E5A11FC007CC4DFF5AF9C7AFBCBEBA3A15B65333657C632F2218A1AC"
         },
         "subject": "reliable_article_types"
        }
       ],
       "started": "2020-08-08T05:59:00.152647"
      }
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "Imported `domain_features` (9CBD2CED) at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import domain_features",
       "finished": "2020-08-08T05:59:01.670222",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "9CBD2CED",
          "sha256": "69E41B5E85F3320A8BED275B947ECA40F456F11EC6734F3E3BCDE4BD64EA9255"
         },
         "subject": "domain_features"
        }
       ],
       "started": "2020-08-08T05:59:01.138425"
      }
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "Imported `popular_journals` (0B2CABD1) at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import popular_journals",
       "finished": "2020-08-08T05:59:02.177712",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "0B2CABD1",
          "sha256": "90D36B3DA0AF97C85591B7E55E1298A1498C6504032163879A08F825EADC3164"
         },
         "subject": "popular_journals"
        }
       ],
       "started": "2020-08-08T05:59:01.673901"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%vault from pubmed_derived_data import literature\n",
    "%vault from pubmed_derived_data import predicted_article_types, reliable_article_types\n",
    "%vault from pubmed_derived_data import domain_features\n",
    "%vault from pubmed_derived_data import popular_journals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Aim**:\n",
    "- verify if TCGA is indeed over-represented in methods papers (and by how much)\n",
    "- collect the disease terms and create an ontology plot to highlight which kind of diseases are well-studied and which are not)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "textual = literature['title'] + ' ' + literature['abstract_clean'].fillna('') + ' ' + literature['full_text'].fillna('')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.11805555555555555"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "literature['mentions_tcga'] = (\n",
    "    textual\n",
    "    .str.lower().str.contains('tcga|the cancer genome atlas')\n",
    ")\n",
    "literature['mentions_tcga'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import concat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_article_types = concat([\n",
    "    predicted_article_types,\n",
    "    reliable_article_types\n",
    "]).loc[literature.index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = (\n",
    "    literature\n",
    "    .drop(columns=['full_text', 'abstract'])\n",
    "    .join(combined_article_types)\n",
    ")\n",
    "data['is_type_predicted'] = data.index.isin(predicted_article_types.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_articles = data.assign(one=1)\n",
    "open_access_subset = all_articles[all_articles.has_full_text == True]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import fisher_exact"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cancer enrichment in multi-omics papers (compared to matched papers from same context)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TIAB is PubMed code for 'title and abstract' search restriction; here we use start with all the articles published in journals of the it is used to match the feature extraction performed on abstracts of articles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Imported `cancer_articles_from_popular_journals_tiab_only` (C6D2493E) at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import cancer_articles_from_popular_journals_tiab_only",
       "finished": "2020-08-08T05:59:03.425039",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "C6D2493E",
          "sha256": "F0C0D1C024BD2CED3E45832958994F88EAB809CDFFAC97C732126B08B87B2C64"
         },
         "subject": "cancer_articles_from_popular_journals_tiab_only"
        }
       ],
       "started": "2020-08-08T05:59:02.895877"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%vault from pubmed_derived_data import cancer_articles_from_popular_journals_tiab_only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Imported `all_articles_by_journal_and_year` (AB6E261E) at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import all_articles_by_journal_and_year",
       "finished": "2020-08-08T05:59:03.992265",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "AB6E261E",
          "sha256": "343D4005442B93F41397AF04892D839174F38A2128ED5A08201A581D7FAF0201"
         },
         "subject": "all_articles_by_journal_and_year"
        }
       ],
       "started": "2020-08-08T05:59:03.460436"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%vault from pubmed_derived_data import all_articles_by_journal_and_year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_articles_mentioning_disease(data):\n",
    "    return (\n",
    "        Series(\n",
    "            data\n",
    "            .mentioned_diseases_set\n",
    "            .astype(object).apply(eval).apply(list)\n",
    "            .sum()\n",
    "        )\n",
    "        .value_counts()\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "cancer                      786\n",
       "disease                     722\n",
       "carcinoma                   132\n",
       "inflammation                 77\n",
       "cardiovascular               68\n",
       "diabetes                     60\n",
       "colorectal cancer            59\n",
       "adenocarcinoma               53\n",
       "hepatocellular carcinoma     47\n",
       "glioblastoma                 42\n",
       "dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "number_of_articles_mentioning_diseases = count_articles_mentioning_disease(domain_features)\n",
    "number_of_articles_mentioning_diseases.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "index\n",
       "Scientific reports                          0.048592\n",
       "Omics : a journal of integrative biology    0.030081\n",
       "Name: share, dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "journal_share_in_multiomics = popular_journals.journal / sum(popular_journals.journal)\n",
    "journal_share_in_multiomics.name = 'share'\n",
    "journal_share_in_multiomics.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def counts_weighted_by_share(data, share):\n",
    "    with_share = data.groupby('journal').sum().join(share)\n",
    "    return (with_share['count'] * with_share['share']).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.22743055555555555"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cancer_articles_weighted = counts_weighted_by_share(cancer_articles_from_popular_journals_tiab_only, journal_share_in_multiomics)\n",
    "all_articles_weighted = counts_weighted_by_share(all_articles_by_journal_and_year, journal_share_in_multiomics)\n",
    "\n",
    "cancer_articles_in_multi_omics = number_of_articles_mentioning_diseases.loc['cancer']\n",
    "articles_in_multi_omics = len(domain_features)\n",
    "\n",
    "cancer_articles_in_multi_omics / articles_in_multi_omics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.09978695663666994"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cancer_articles_weighted / all_articles_weighted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.07484122553416218"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(cancer_articles_weighted - cancer_articles_in_multi_omics) / (all_articles_weighted - articles_in_multi_omics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[786, 1323], [3456, 17683]]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cancer_in_multiple_vs_other = [\n",
    "    [cancer_articles_in_multi_omics, int(cancer_articles_weighted) - cancer_articles_in_multi_omics],\n",
    "    [articles_in_multi_omics, int(all_articles_weighted) - articles_in_multi_omics]\n",
    "]\n",
    "cancer_in_multiple_vs_other"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3.0397993302259176, 3.179123156040738e-105)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fisher_exact(cancer_in_multiple_vs_other)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2.2795896225172543, 3.1306778185794075e-66)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cancer_in_multiple_vs_all = [\n",
    "    [cancer_articles_in_multi_omics, int(cancer_articles_weighted)],\n",
    "    [articles_in_multi_omics, int(all_articles_weighted)]\n",
    "]\n",
    "fisher_exact(cancer_in_multiple_vs_all)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Note: this is not as strong without weighting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which is not surprising, given that journals are not focusing on specific topics, including cancer. Journal publishing a lot of cancer research which has published 3 multi-omics articles would be then counted in as much as \"Omics\", \"Bioinformatics\", even though the latter are where the majority of the multi-omics articles get published."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.11564909586403536"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cancer_articles_crude = cancer_articles_from_popular_journals_tiab_only['count'].sum()\n",
    "all_articles_crude = all_articles_by_journal_and_year['count'].sum()\n",
    "\n",
    "cancer_articles_crude / all_articles_crude"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1.9692534874417003, 3.9425319401519796e-57)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fisher_exact([\n",
    "    [cancer_articles_in_multi_omics, cancer_articles_crude - cancer_articles_in_multi_omics],\n",
    "    [articles_in_multi_omics, all_articles_crude - articles_in_multi_omics]\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Diligence check: would it hold if we looked at the full-text articles only?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yes, but the effect size is lower (higher p-value is expected also because we look at a subset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Imported `cancer_articles_from_popular_journals_any_field` (6931F0FF) at Saturday, 08. Aug 2020 05:59"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {
      "text/markdown": {
       "action": "import",
       "command": "from pubmed_derived_data import cancer_articles_from_popular_journals_any_field",
       "finished": "2020-08-08T05:59:05.131286",
       "finished_human_readable": "Saturday, 08. Aug 2020 05:59",
       "result": [
        {
         "new_file": {
          "crc32": "6931F0FF",
          "sha256": "D891354ECC232F9BDC07328CDBE8707ECE13127B0850FB3C67CA065D49D34C34"
         },
         "subject": "cancer_articles_from_popular_journals_any_field"
        }
       ],
       "started": "2020-08-08T05:59:04.585791"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%vault from pubmed_derived_data import cancer_articles_from_popular_journals_any_field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "open_acess_journal_freq = open_access_subset.journal.sorted_value_counts()\n",
    "oa_popular_journals = open_acess_journal_freq[open_acess_journal_freq >= 3]\n",
    "oa_popular_journals.sum() / oa_popular_journals.sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "index\n",
       "Scientific reports    0.102310\n",
       "PloS one              0.056106\n",
       "Name: share, dtype: float64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "oa_journal_share_in_multiomics = oa_popular_journals / sum(oa_popular_journals)\n",
    "oa_journal_share_in_multiomics.name = 'share'\n",
    "oa_journal_share_in_multiomics.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.2565789473684211, 0.1283517981102067)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "oa_cancer_articles_weighted = counts_weighted_by_share(cancer_articles_from_popular_journals_any_field, oa_journal_share_in_multiomics)\n",
    "oa_all_articles_weighted = counts_weighted_by_share(all_articles_by_journal_and_year, oa_journal_share_in_multiomics)\n",
    "\n",
    "oa_number_of_articles_mentioning_diseases = count_articles_mentioning_disease(domain_features.loc[open_access_subset.index])\n",
    "oa_cancer_articles_in_multi_omics = oa_number_of_articles_mentioning_diseases.loc['cancer']\n",
    "oa_articles_in_multi_omics = len(open_access_subset)\n",
    "\n",
    "oa_cancer_articles_in_multi_omics / oa_articles_in_multi_omics, oa_cancer_articles_weighted / oa_all_articles_weighted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2.0987110633727175, 9.348194227640988e-32)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fisher_exact([\n",
    "    [oa_cancer_articles_in_multi_omics, oa_cancer_articles_weighted - oa_cancer_articles_in_multi_omics],\n",
    "    [oa_articles_in_multi_omics, oa_all_articles_weighted - oa_articles_in_multi_omics]\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TCGA enrichment in computational method papers (compared to other types)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[32, 34], [287, 1167]]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "oa_tcga_mentions_vs_method = [\n",
    "    [open_access_subset.query('is_method and mentions_tcga').one.sum(), open_access_subset.query('is_method and not mentions_tcga').one.sum()],\n",
    "    [open_access_subset.query('not is_method and mentions_tcga').one.sum(), open_access_subset.query('not is_method and not mentions_tcga').one.sum()]\n",
    "]\n",
    "oa_tcga_mentions_vs_method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3.827013732322197, 4.452431104649725e-07)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fisher_exact(oa_tcga_mentions_vs_method)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.19738651994497936"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "open_access_subset.query('not is_method').mentions_tcga.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.48484848484848486"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "open_access_subset.query('is_method').mentions_tcga.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Diligence check: does it hold on the manually verified methods?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Yes, because all full-text method articles were verified/no new methods were predicted from open-access subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.19738651994497936"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "open_access_subset.query('not is_method and (not is_type_predicted)').mentions_tcga.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.48484848484848486"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "open_access_subset.query('is_method and (not is_type_predicted)').mentions_tcga.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Diligence check: does it hold on the larger superset (for articles with no full text)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1094692400482509"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_articles.query('not is_method').mentions_tcga.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.32142857142857145"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_articles.query('is_method').mentions_tcga.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3.8534145280556764, 5.318078390481294e-11)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fisher_exact(\n",
    "    [\n",
    "        [all_articles.query('is_method and mentions_tcga').one.sum(), all_articles.query('is_method and not mentions_tcga').one.sum()],\n",
    "        [all_articles.query('not is_method and mentions_tcga').one.sum(), all_articles.query('not is_method and not mentions_tcga').one.sum()]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yes, and the effect-size even larger and p-value lower! But the we should report the more conservative finding from the open-access subset, because:\n",
    "\n",
    "- I would not expect computational method papers to announce that they use TCGA data in abstract - they will keep that as a detail in methods\n",
    "  - thus the open-access subset should provides more accurate representation\n",
    "- All the computational methods articles in the open-access subset come from manual curation and not prediction"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}