pyMultiOmics / Git / [7d5693] /templates/cic

Models:
AlyssaS/
pyMultiOmics
Downloads: 1
[7d5693]: / templates / cic_template.ipynb
History
Download this file
670 lines (669 with data), 18.6 kB

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "sys.path.append('..')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyMultiOmics.common import set_log_level_warning\n",
    "_ = set_log_level_warning()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyMultiOmics.loader import process_affinity_data, get_mappable_affinity_data\n",
    "from pyMultiOmics.base import SingleOmicsData, MultiOmicsData\n",
    "from pyMultiOmics.constants import CIC_COMPOUNDS, INFERENCE_T_TEST, HOMO_SAPIENS\n",
    "from pyMultiOmics.analysis import AnalysisPipeline\n",
    "from pyMultiOmics.mapping import Mapper\n",
    "from pyMultiOmics.functions import send_reactome_ora, send_reactome_expression, show_ora_pathways"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Data Pre-processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "The CiC Affinity Biomarkers data was read from the specified Excel file. To preprocess the data, analytes that fall below the Limit of Detection and appear in less than 50% of the samples in each plate were removed. Missing values for each analyte were replaced with the minimum value present. Batch correction was performed using [pyComBat](https://github.com/epigenelabs/pyComBat) to eliminate batch effects across plates. Subsequently, a logarithmic transformation was applied, followed by standardization to ensure each feature has a mean of 0 and a variance of 1. The transformed and standardized data is displayed in the heatmap below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# This cell is tagged 'parameters'\n",
    "file_name = None\n",
    "annot_csv = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(file_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filter_method = 'batch'\n",
    "filter_thresh = 0.50\n",
    "correct_batch = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data_df, sample_metadata_df, feature_metadata_df = process_affinity_data(\n",
    "    file_name, filter_method=filter_method, filter_thresh=filter_thresh, correct_batch=correct_batch, annot_csv=annot_csv\n",
    ")\n",
    "print('Number of analytes =', data_df.shape[0])\n",
    "print('Number of samples =', data_df.shape[1])\n",
    "print('Groups =', sample_metadata_df['group'].unique().tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cic_data = SingleOmicsData(CIC_COMPOUNDS, data_df, sample_metadata_df, feature_annot_df=feature_metadata_df)\n",
    "mo = MultiOmicsData()\n",
    "mo.add_data([cic_data])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "normalise = 'standard'\n",
    "log = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# This cell is tagged 'parameters'\n",
    "PCA_components = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtype = CIC_COMPOUNDS\n",
    "return_fig = True\n",
    "\n",
    "ap = AnalysisPipeline(mo, None)\n",
    "_ = ap.heatmap(dtype, normalise=normalise, log=log, return_fig=return_fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transformed and scaled data was subjected to Principal Component Analysis using 5 components. A scatterplot displaying the projection of samples onto different components is presented below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, design_df = ap.multi_omics_data.get_dfs(dtype)\n",
    "pcs = ap.scatter_PCA(dtype, normalise=normalise, log=log, n_components=PCA_components, hue=design_df['Plate number'], return_fig=return_fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering Using k-means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "K-means clustering is utilised to identify similar clusters, either based on samples or analytes. The best K (number of clusters) is automatically chosen by identifying turning point (the elbow) in the silhouette score plot. This allows for the identification of groups of analytes with similar intensity values, or samples that share similar characteristics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering of Samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below are the results of the k-means clustering performed on the samples.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kind = 'samples'\n",
    "k = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcs = ap.PCA(dtype, normalise=normalise, log=log, n_components=5, kind=kind)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_labels, cluster_order, centroids, silhouette_scores = ap.cluster(dtype, normalise=normalise, log=log, return_fig=return_fig, kind=kind, k=k, pcs=pcs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Principal Component Analysis (PCA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A plot of the PCA projection of samples is displayed below, with cluster labels represented by different colours based on the best number of clusters selected using k-means clustering. The samples in different groups are separated by different shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, design_df = ap.multi_omics_data.get_dfs(dtype)\n",
    "pcs = ap.PCA(dtype, normalise=normalise, log=log, n_components=PCA_components, style=design_df['group'], hue=cluster_labels, return_fig=return_fig, kind=kind)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Heatmap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The heatmap displayed below is organised by cluster labels, showcasing the clustering patterns of the samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = ap.heatmap(dtype, normalise=normalise, log=log, return_fig=return_fig, kind=kind, cluster_order=cluster_order)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering of Analytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below are the results of the k-means clustering performed on the analytes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kind = 'features'\n",
    "k = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pcs = ap.PCA(dtype, normalise=normalise, log=log, n_components=5, kind=kind)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_labels, cluster_order, centroids, silhouette_scores = ap.cluster(dtype, normalise=normalise, log=log, return_fig=return_fig, kind=kind, k=k, pcs=pcs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Principal Component Analysis (PCA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A plot of the PCA projection of analytes is displayed below, with cluster labels represented by different colours."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, design_df = ap.multi_omics_data.get_dfs(dtype)\n",
    "pcs = ap.PCA(dtype, normalise=normalise, log=log, n_components=PCA_components, hue=cluster_labels, return_fig=return_fig, kind=kind)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Heatmap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The heatmap displayed below is organized by cluster labels, showcasing the clustering patterns of the analytes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = ap.heatmap(dtype, normalise=normalise, log=log, return_fig=return_fig, kind=kind, cluster_order=cluster_order)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case-vs-control Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this analysis, we perform a case-control comparison. Specifically, we conduct t-tests to compare the means of the case and control groups, as defined below. To control for multiple testing, we apply the Benjamini-Hochberg method to correct the False Discovery Rate (FDR)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Groups =', sample_metadata_df['group'].unique().tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# This cell is tagged 'parameters'\n",
    "case_group = 'disease'\n",
    "control_group = 'control'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "de_method = INFERENCE_T_TEST\n",
    "ap.run_de(de_method, dtype, case_group, control_group)\n",
    "de_df = ap.get_de_results(dtype, case_group, control_group, de_method)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Volcano Plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following Volcano Plot presents the results of T-tests, with a default p-value threshold of 0.05. Analytes were filtered using the Interquartile Range (IQR) method to remove outliers. This involved calculating the IQR as the difference between the 75th and 25th percentiles of the fold change (FC) values. Upper and lower bounds were then defined as 1.5 times the IQR above the 75th percentile and below the 25th percentile, respectively. Any analytes with FC values outside of these bounds were identified as outliers and removed from the analysis.\n",
    "\n",
    "The top-10 analytes with the largest and smallest fold changes, which were also significantly different between the case and control groups, are highlighted in the volcano plot. Additionally, the top-20 significant analytes by fold change are reported in the tables that follow the plot. This information potentially highlights interesting biomarkers associated with the case and control groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_value_colname = 'padj_%s_vs_%s' % (case_group, control_group)\n",
    "fc_colname = 'FC_%s_vs_%s' % (case_group, control_group)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# This cell is tagged 'parameters'\n",
    "p_value_thresh = 0.05\n",
    "fc_iqr_thresh = 1.5\n",
    "volcano_top_n = 10\n",
    "report_top_n = 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ap.volcano(de_df, p_value_colname, p_value_thresh, fc_colname, fc_iqr_thresh=fc_iqr_thresh, top_n=volcano_top_n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Significantly-changing Analytes Ordered by Fold Changes (Descending)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a list of analytes that have shown significant changes in abundance. The list is sorted in descending order based on their fold changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fc_sort_order = 'desc'\n",
    "sorted_df_asc = ap.de_sort_and_filter(de_df, p_value_colname, p_value_thresh, fc_colname, \n",
    "                                      fc_sort_order=fc_sort_order, top_n=report_top_n, fc_iqr_thresh=fc_iqr_thresh)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_df = pd.merge(sorted_df_asc, cic_data.feature_annot_df[['Analyte Class', 'Analyte Name']], left_index=True, right_index=True, how='left')\n",
    "merged_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Significantly-changing Analytes Ordered by Fold Changes (Ascending)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a list of analytes that have shown significant changes in abundance. The list is sorted in ascending order based on their fold changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fc_sort_order = 'asc'\n",
    "sorted_df_desc = ap.de_sort_and_filter(de_df, p_value_colname, p_value_thresh, fc_colname, \n",
    "                                      fc_sort_order=fc_sort_order, top_n=report_top_n, fc_iqr_thresh=fc_iqr_thresh)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_df = pd.merge(sorted_df_desc, cic_data.feature_annot_df[['Analyte Class', 'Analyte Name']], left_index=True, right_index=True, how='left')\n",
    "merged_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Significant Pathway Analysis"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "By utilising the provided ChEBI ID annotations, we can map annotated compounds to Reactome pathways. To increase the mapping coverage, we have included related ChEBI IDs such as the conjugate base, acid, and tautomer of the provided IDs. Using the mapped and differentially expressed compounds across the case vs control groups, we performed an over-representation analysis using the Reactome Analysis Service to identify pathways with an enrichment of the mapped compounds. This analysis provides valuable insight into the biological processes associated with these compounds. The top-20 pathways are shown in the results below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis_species = HOMO_SAPIENS\n",
    "de_only = True # TODO: don't forget to change this to True for real data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "new_mo = get_mappable_affinity_data(mo, 'ChEBI')\n",
    "m = Mapper(new_mo, analysis_species, metabolic_pathway_only=True, include_related_chebi=True)\n",
    "_ = m.build()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ora_df = send_reactome_ora(ap, m, de_df, p_value_colname, p_value_thresh, fc_colname, fc_iqr_thresh, analysis_species, de_only=de_only)\n",
    "ora_df[ora_df['entities_pValue'] < p_value_thresh].head(20)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Expression data of mapped compounds was sent to Reactome and mapped onto Reactome pathway diagrams. The expression data was in the form of log fold change of compounds, which was scaled between -1 and 1 for clear visualization on the diagrams. Below are the pathway diagrams for the top 5 most significant pathways, ranked by their p-values from ORA analysis. Note that all compounds and their expression levels are included in the diagram, regardless of whether they are differentially expressed or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "normalise = True\n",
    "expression_df, token = send_reactome_expression(ap, m, de_df, p_value_colname, fc_colname, analysis_species, case_group, control_group, normalise=normalise)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "show_ora_pathways(ora_df, token, N=5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}