1021 lines (1020 with data), 26.0 kB
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%run notebook_setup.ipynb"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Imported `literature` (904B0F94) at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import literature",
"finished": "2020-08-08T05:59:00.149124",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "904B0F94",
"sha256": "A2EFC068A287A3B724AE4B320EE5356E1E99474BD08A2E2A3EBA34CD0194F23B"
},
"subject": "literature"
}
],
"started": "2020-08-08T05:58:58.159275"
}
},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Imported:\n",
"\n",
" - `predicted_article_types` (3D39430E)\n",
" - `reliable_article_types` (5D584CB5)\n",
"\n",
"at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import predicted_article_types, reliable_article_types",
"finished": "2020-08-08T05:59:01.134868",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "3D39430E",
"sha256": "C434CF669D09A80085574C5EAF7D4B6154FF04EC1A2143DA15E42E464E3314E9"
},
"subject": "predicted_article_types"
},
{
"new_file": {
"crc32": "5D584CB5",
"sha256": "585366F3E5A11FC007CC4DFF5AF9C7AFBCBEBA3A15B65333657C632F2218A1AC"
},
"subject": "reliable_article_types"
}
],
"started": "2020-08-08T05:59:00.152647"
}
},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Imported `domain_features` (9CBD2CED) at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import domain_features",
"finished": "2020-08-08T05:59:01.670222",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "9CBD2CED",
"sha256": "69E41B5E85F3320A8BED275B947ECA40F456F11EC6734F3E3BCDE4BD64EA9255"
},
"subject": "domain_features"
}
],
"started": "2020-08-08T05:59:01.138425"
}
},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Imported `popular_journals` (0B2CABD1) at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import popular_journals",
"finished": "2020-08-08T05:59:02.177712",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "0B2CABD1",
"sha256": "90D36B3DA0AF97C85591B7E55E1298A1498C6504032163879A08F825EADC3164"
},
"subject": "popular_journals"
}
],
"started": "2020-08-08T05:59:01.673901"
}
},
"output_type": "display_data"
}
],
"source": [
"%vault from pubmed_derived_data import literature\n",
"%vault from pubmed_derived_data import predicted_article_types, reliable_article_types\n",
"%vault from pubmed_derived_data import domain_features\n",
"%vault from pubmed_derived_data import popular_journals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Aim**:\n",
"- verify if TCGA is indeed over-represented in methods papers (and by how much)\n",
"- collect the disease terms and create an ontology plot to highlight which kind of diseases are well-studied and which are not)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"textual = literature['title'] + ' ' + literature['abstract_clean'].fillna('') + ' ' + literature['full_text'].fillna('')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.11805555555555555"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"literature['mentions_tcga'] = (\n",
" textual\n",
" .str.lower().str.contains('tcga|the cancer genome atlas')\n",
")\n",
"literature['mentions_tcga'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from pandas import concat"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"combined_article_types = concat([\n",
" predicted_article_types,\n",
" reliable_article_types\n",
"]).loc[literature.index]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"data = (\n",
" literature\n",
" .drop(columns=['full_text', 'abstract'])\n",
" .join(combined_article_types)\n",
")\n",
"data['is_type_predicted'] = data.index.isin(predicted_article_types.index)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"all_articles = data.assign(one=1)\n",
"open_access_subset = all_articles[all_articles.has_full_text == True]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import fisher_exact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cancer enrichment in multi-omics papers (compared to matched papers from same context)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TIAB is PubMed code for 'title and abstract' search restriction; here we use start with all the articles published in journals of the it is used to match the feature extraction performed on abstracts of articles:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Imported `cancer_articles_from_popular_journals_tiab_only` (C6D2493E) at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import cancer_articles_from_popular_journals_tiab_only",
"finished": "2020-08-08T05:59:03.425039",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "C6D2493E",
"sha256": "F0C0D1C024BD2CED3E45832958994F88EAB809CDFFAC97C732126B08B87B2C64"
},
"subject": "cancer_articles_from_popular_journals_tiab_only"
}
],
"started": "2020-08-08T05:59:02.895877"
}
},
"output_type": "display_data"
}
],
"source": [
"%vault from pubmed_derived_data import cancer_articles_from_popular_journals_tiab_only"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Imported `all_articles_by_journal_and_year` (AB6E261E) at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import all_articles_by_journal_and_year",
"finished": "2020-08-08T05:59:03.992265",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "AB6E261E",
"sha256": "343D4005442B93F41397AF04892D839174F38A2128ED5A08201A581D7FAF0201"
},
"subject": "all_articles_by_journal_and_year"
}
],
"started": "2020-08-08T05:59:03.460436"
}
},
"output_type": "display_data"
}
],
"source": [
"%vault from pubmed_derived_data import all_articles_by_journal_and_year"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def count_articles_mentioning_disease(data):\n",
" return (\n",
" Series(\n",
" data\n",
" .mentioned_diseases_set\n",
" .astype(object).apply(eval).apply(list)\n",
" .sum()\n",
" )\n",
" .value_counts()\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"cancer 786\n",
"disease 722\n",
"carcinoma 132\n",
"inflammation 77\n",
"cardiovascular 68\n",
"diabetes 60\n",
"colorectal cancer 59\n",
"adenocarcinoma 53\n",
"hepatocellular carcinoma 47\n",
"glioblastoma 42\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"number_of_articles_mentioning_diseases = count_articles_mentioning_disease(domain_features)\n",
"number_of_articles_mentioning_diseases.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"index\n",
"Scientific reports 0.048592\n",
"Omics : a journal of integrative biology 0.030081\n",
"Name: share, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"journal_share_in_multiomics = popular_journals.journal / sum(popular_journals.journal)\n",
"journal_share_in_multiomics.name = 'share'\n",
"journal_share_in_multiomics.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def counts_weighted_by_share(data, share):\n",
" with_share = data.groupby('journal').sum().join(share)\n",
" return (with_share['count'] * with_share['share']).sum()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.22743055555555555"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cancer_articles_weighted = counts_weighted_by_share(cancer_articles_from_popular_journals_tiab_only, journal_share_in_multiomics)\n",
"all_articles_weighted = counts_weighted_by_share(all_articles_by_journal_and_year, journal_share_in_multiomics)\n",
"\n",
"cancer_articles_in_multi_omics = number_of_articles_mentioning_diseases.loc['cancer']\n",
"articles_in_multi_omics = len(domain_features)\n",
"\n",
"cancer_articles_in_multi_omics / articles_in_multi_omics"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.09978695663666994"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cancer_articles_weighted / all_articles_weighted"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.07484122553416218"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(cancer_articles_weighted - cancer_articles_in_multi_omics) / (all_articles_weighted - articles_in_multi_omics)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[786, 1323], [3456, 17683]]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cancer_in_multiple_vs_other = [\n",
" [cancer_articles_in_multi_omics, int(cancer_articles_weighted) - cancer_articles_in_multi_omics],\n",
" [articles_in_multi_omics, int(all_articles_weighted) - articles_in_multi_omics]\n",
"]\n",
"cancer_in_multiple_vs_other"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3.0397993302259176, 3.179123156040738e-105)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fisher_exact(cancer_in_multiple_vs_other)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2.2795896225172543, 3.1306778185794075e-66)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cancer_in_multiple_vs_all = [\n",
" [cancer_articles_in_multi_omics, int(cancer_articles_weighted)],\n",
" [articles_in_multi_omics, int(all_articles_weighted)]\n",
"]\n",
"fisher_exact(cancer_in_multiple_vs_all)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Note: this is not as strong without weighting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which is not surprising, given that journals are not focusing on specific topics, including cancer. Journal publishing a lot of cancer research which has published 3 multi-omics articles would be then counted in as much as \"Omics\", \"Bioinformatics\", even though the latter are where the majority of the multi-omics articles get published."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.11564909586403536"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cancer_articles_crude = cancer_articles_from_popular_journals_tiab_only['count'].sum()\n",
"all_articles_crude = all_articles_by_journal_and_year['count'].sum()\n",
"\n",
"cancer_articles_crude / all_articles_crude"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1.9692534874417003, 3.9425319401519796e-57)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fisher_exact([\n",
" [cancer_articles_in_multi_omics, cancer_articles_crude - cancer_articles_in_multi_omics],\n",
" [articles_in_multi_omics, all_articles_crude - articles_in_multi_omics]\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Diligence check: would it hold if we looked at the full-text articles only?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes, but the effect size is lower (higher p-value is expected also because we look at a subset)."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Imported `cancer_articles_from_popular_journals_any_field` (6931F0FF) at Saturday, 08. Aug 2020 05:59"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {
"text/markdown": {
"action": "import",
"command": "from pubmed_derived_data import cancer_articles_from_popular_journals_any_field",
"finished": "2020-08-08T05:59:05.131286",
"finished_human_readable": "Saturday, 08. Aug 2020 05:59",
"result": [
{
"new_file": {
"crc32": "6931F0FF",
"sha256": "D891354ECC232F9BDC07328CDBE8707ECE13127B0850FB3C67CA065D49D34C34"
},
"subject": "cancer_articles_from_popular_journals_any_field"
}
],
"started": "2020-08-08T05:59:04.585791"
}
},
"output_type": "display_data"
}
],
"source": [
"%vault from pubmed_derived_data import cancer_articles_from_popular_journals_any_field"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open_acess_journal_freq = open_access_subset.journal.sorted_value_counts()\n",
"oa_popular_journals = open_acess_journal_freq[open_acess_journal_freq >= 3]\n",
"oa_popular_journals.sum() / oa_popular_journals.sum()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"index\n",
"Scientific reports 0.102310\n",
"PloS one 0.056106\n",
"Name: share, dtype: float64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"oa_journal_share_in_multiomics = oa_popular_journals / sum(oa_popular_journals)\n",
"oa_journal_share_in_multiomics.name = 'share'\n",
"oa_journal_share_in_multiomics.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.2565789473684211, 0.1283517981102067)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"oa_cancer_articles_weighted = counts_weighted_by_share(cancer_articles_from_popular_journals_any_field, oa_journal_share_in_multiomics)\n",
"oa_all_articles_weighted = counts_weighted_by_share(all_articles_by_journal_and_year, oa_journal_share_in_multiomics)\n",
"\n",
"oa_number_of_articles_mentioning_diseases = count_articles_mentioning_disease(domain_features.loc[open_access_subset.index])\n",
"oa_cancer_articles_in_multi_omics = oa_number_of_articles_mentioning_diseases.loc['cancer']\n",
"oa_articles_in_multi_omics = len(open_access_subset)\n",
"\n",
"oa_cancer_articles_in_multi_omics / oa_articles_in_multi_omics, oa_cancer_articles_weighted / oa_all_articles_weighted"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2.0987110633727175, 9.348194227640988e-32)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fisher_exact([\n",
" [oa_cancer_articles_in_multi_omics, oa_cancer_articles_weighted - oa_cancer_articles_in_multi_omics],\n",
" [oa_articles_in_multi_omics, oa_all_articles_weighted - oa_articles_in_multi_omics]\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TCGA enrichment in computational method papers (compared to other types)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[32, 34], [287, 1167]]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"oa_tcga_mentions_vs_method = [\n",
" [open_access_subset.query('is_method and mentions_tcga').one.sum(), open_access_subset.query('is_method and not mentions_tcga').one.sum()],\n",
" [open_access_subset.query('not is_method and mentions_tcga').one.sum(), open_access_subset.query('not is_method and not mentions_tcga').one.sum()]\n",
"]\n",
"oa_tcga_mentions_vs_method"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3.827013732322197, 4.452431104649725e-07)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fisher_exact(oa_tcga_mentions_vs_method)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.19738651994497936"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open_access_subset.query('not is_method').mentions_tcga.mean()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.48484848484848486"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open_access_subset.query('is_method').mentions_tcga.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Diligence check: does it hold on the manually verified methods?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(Yes, because all full-text method articles were verified/no new methods were predicted from open-access subset)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.19738651994497936"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open_access_subset.query('not is_method and (not is_type_predicted)').mentions_tcga.mean()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.48484848484848486"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open_access_subset.query('is_method and (not is_type_predicted)').mentions_tcga.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Diligence check: does it hold on the larger superset (for articles with no full text)?"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.1094692400482509"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_articles.query('not is_method').mentions_tcga.mean()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.32142857142857145"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_articles.query('is_method').mentions_tcga.mean()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3.8534145280556764, 5.318078390481294e-11)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fisher_exact(\n",
" [\n",
" [all_articles.query('is_method and mentions_tcga').one.sum(), all_articles.query('is_method and not mentions_tcga').one.sum()],\n",
" [all_articles.query('not is_method and mentions_tcga').one.sum(), all_articles.query('not is_method and not mentions_tcga').one.sum()]\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes, and the effect-size even larger and p-value lower! But the we should report the more conservative finding from the open-access subset, because:\n",
"\n",
"- I would not expect computational method papers to announce that they use TCGA data in abstract - they will keep that as a detail in methods\n",
" - thus the open-access subset should provides more accurate representation\n",
"- All the computational methods articles in the open-access subset come from manual curation and not prediction"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}