[427757]: / R-scripts / scripts / Comparison_in_cancer.ipynb

Download this file

1012 lines (1011 with data), 40.3 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SUB-BENCHMARK 2: Comparison of the jDR methos on cancer datasets\n",
    "\n",
    "We here reproduce the second sub-benchmark proposed in the paper. We thus compare here the performances of the various jDR methods on the integration of multi-omics cancer datasets from TCGA. The methods are evaluated regarding associations to clinical annotations, survival and biological annotations (GO, REACTOME, Hallmarks)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison based on clinical annotations 基于临床注释的比较\n",
    "\n",
    "We define here a function that performs the jDR comparison based on their association to clinical annotations. In particular, we here compute the number of clinical annotations significantly associated with at least one factor, and their selectivity (see Methods paper). The clinical metadata need to be stored in a clinical folder containing files named according to each cancer type. \n",
    "我们在这里定义了一个函数,它根据jDR与临床注释的关联来执行jDR比较。特别是,我们在这里计算与至少一个因素显著相关的临床注释的数量,以及它们的选择性(见方法论文)。临床元数据需要存储在临床文件夹中,其中包含根据每种癌症类型命名的文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Perform clinical annotation-based comparison \n",
    "## INPUTS:\n",
    "# factorizations = already computed factirizations\n",
    "# clinical = clinical information associated with samples\n",
    "# col = columns in clinical data on which the analysis will be performed\n",
    "## OUPUTS: a list containing output values\n",
    "# selectivity = Selectivity (fraction of significant annotations per all significant factors)\n",
    "# nonZeroFacs = Number of unique factors that were significant at least once\n",
    "# total_pathways = Number of clinical annotations with at least one significant factor component\n",
    "##执行基于临床注释的比较\n",
    "##输入:\n",
    "#factorizations=已计算的因式分解\n",
    "#clinical=与样本相关的临床信息\n",
    "#col=将对其执行分析的临床数据列\n",
    "##输出:包含输出值的列表\n",
    "#selectivity=选择性(每个重要因素的重要注释分数)\n",
    "#nonZeroFacs=至少一次显著的唯一因素的数量\n",
    "#total_pathways=至少有一个显著因子成分的临床注释数\n",
    "clinical_comparison <- function(factorizations, clinical, col){\n",
    "\n",
    "    # Empty containers\n",
    "    line <- numeric(0)\n",
    "    line2 <- numeric(0)\n",
    "    line3 <- numeric(0)\n",
    "\n",
    "    # For each factorization\n",
    "    for(i in 1:length(factorizations)){\n",
    "\n",
    "        # Extract sample association\n",
    "        factors <- factorizations[[i]][[1]]\n",
    "        if(is.null(names(factors))) { names(factors) <- 1:ncol(factors) }\n",
    "        if(is.null(colnames(factors))) { colnames(factors) <- 1:ncol(factors) }\n",
    "\n",
    "        # Patient names in factorisation results\n",
    "        patient.names <- rownames(factors)\n",
    "        # Patient names in original data\n",
    "        patient.names.in.file <- as.character(clinical[, 1])\n",
    "        patient.names.in.file <- toupper(gsub('-', '\\\\.', patient.names.in.file))\n",
    "        # Remove non-matching patient names\n",
    "        is_in_file <- patient.names %in% patient.names.in.file\n",
    "        if(length(patient.names)!=sum(is_in_file)) {\n",
    "            factors <- factors[is_in_file, ]\n",
    "            patient.names <- patient.names[is_in_file]\n",
    "            rownames(factors)<-patient.names\n",
    "        }\n",
    "        # Match indices of patient names\n",
    "        indices <- match(patient.names, patient.names.in.file)\n",
    "        # Use indices to extract coresponding survival information\n",
    "        ordered.clinical.data <- clinical[indices,]        \n",
    "        \n",
    "        # Only use column names that are really present in the data\n",
    "        col_new <- col[col %in% colnames(ordered.clinical.data)]\n",
    "\n",
    "        # Stor all p-values\n",
    "        pvalues <- numeric(0)\n",
    "        # Store number of significant annotations\n",
    "        clin_erich <- 0 \n",
    "        \n",
    "        # Test significance association with clinical annotations\n",
    "        for(j in col_new){\n",
    "            \n",
    "            # Perform the analysis if there is more than one possible value in current column\n",
    "            table_values <- table(ordered.clinical.data[,j])\n",
    "            if(sum(table_values>0)>1){\n",
    "\n",
    "                if(j == \"age_at_initial_pathologic_diagnosis\" ){\n",
    "                    pvalues_col <- apply(factors, MARGIN=2, \n",
    "                                       function(x) kruskal.test(x~cut(as.numeric(ordered.clinical.data[,j]),\n",
    "                                                                      5, include.lowest=TRUE))$p.value)\n",
    "                    pvalues <- c(pvalues, pvalues_col)\n",
    "                    if(min(pvalues_col)<0.05){\n",
    "                        clin_erich <- clin_erich+1\n",
    "                    }\n",
    "                }else if(j == \"days_to_new_tumor_event_after_initial_treatment\"){\n",
    "                    pvalues_col <- apply(factors,MARGIN=2, \n",
    "                                       function(x) kruskal.test(x~cut(as.numeric(ordered.clinical.data[,j]), \n",
    "                                                                      3, include.lowest=TRUE))$p.value)\n",
    "                    pvalues <- c(pvalues, pvalues_col)\n",
    "                    if(min(pvalues_col)<0.05){\n",
    "                        clin_erich <- clin_erich+1\n",
    "                    }\n",
    "                }else if(j == \"gender\" || j == \"history_of_neoadjuvant_treatment\"){\n",
    "                    pvalues_col <- apply(factors, MARGIN=2, \n",
    "                                       function(x) wilcox.test(x~ordered.clinical.data[,j])$p.value)\n",
    "                    pvalues <- c(pvalues, pvalues_col)\n",
    "                    if(min(pvalues_col)<0.05){\n",
    "                        clin_erich <- clin_erich+1\n",
    "                    }\n",
    "                }\n",
    "            }\n",
    "        }\n",
    "        \n",
    "        # Number of clinical annotations with at least one significant p-value\n",
    "        line3 <- rbind(line3, clin_erich)\n",
    "\n",
    "        # Total number of significant factors in all tested columns\n",
    "        column <- names(pvalues)[pvalues<0.05]\n",
    "        # Number of unique factors that were significant at least once\n",
    "        line2<-rbind(line2, length(unique(column)))\n",
    "\n",
    "        # Number of times a p-value was found significant\n",
    "        signif <- length(column)\n",
    "        f<-length(unique(column))                                 \n",
    "        # Selectivity \n",
    "        if(signif!=0){\n",
    "            line <- rbind(line,((clin_erich/signif)+(f/signif))/2)\n",
    "        }else{\n",
    "            line <- rbind(line,0)\n",
    "        }\n",
    "\n",
    "    }\n",
    "    \n",
    "    # Store and return results\n",
    "    out <- data.frame(selectivity=line, nonZeroFacs=line2, total_pathways=line3)\n",
    "#    print(out)\n",
    "    return(out)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison based on survival predictions 基于生存预测的比较\n",
    "\n",
    "We here define the function that compares the performances of the various jDR methods based on their association to survival (computed through Cox regression).\n",
    "我们在这里定义了一个函数,它根据各种jDR方法与生存率的关系(通过Cox回归计算)来比较它们的性能。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "library('survival')\n",
    "\n",
    "## Perform survival annotation-based comparison \n",
    "## INPUTS:\n",
    "# factorizations = already computed factirizations\n",
    "# method = methods used for factorization\n",
    "# survival = survival data associated to the cancer\n",
    "# out.folder = folder where results will be written\n",
    "# cancer = name of currently analysed cancer\n",
    "## OUPUTS: a list containing output values\n",
    "\n",
    "##执行基于注释的比较\n",
    "##输入:\n",
    "#factorizations=已计算的因式分解\n",
    "#method=用于因子分解的方法\n",
    "#survival=与癌症相关的生存数据\n",
    "# out.folder文件夹=将写入结果的文件夹\n",
    "#cancer=当前分析癌症的名称\n",
    "##输出:包含输出值的列表\n",
    "survival_comparison <- function(factorizations, method, survival, out.folder, cancer){\n",
    "    \n",
    "    # Initialize result containers\n",
    "    factors_cancer <- numeric(0)\n",
    "    surv_final <- numeric(0)\n",
    "    \n",
    "    # Adjust sample names for breast survival dataset\n",
    "    if(cancer==\"breast\"){survival[,1] <- paste0(survival[,1],\"-01\")}\n",
    "    \n",
    "    # For each computed factorisation\n",
    "    for(i in 1:length(factorizations)){\n",
    "\n",
    "        # Extract sample factors \n",
    "        factors <- factorizations[[i]][[1]]\n",
    "\n",
    "        # Patient names in factorisation results\n",
    "        patient.names <- rownames(factors)\n",
    "        # Patient names in original data\n",
    "        patient.names.in.file <- as.character(survival[, 1])\n",
    "        patient.names.in.file <- toupper(gsub('-', '\\\\.', patient.names.in.file))\n",
    "        # Remove non-matching patient names\n",
    "        is_in_file <- patient.names %in% patient.names.in.file\n",
    "        if(length(patient.names)!=sum(is_in_file)) {\n",
    "            factors <- factors[is_in_file, ]\n",
    "            patient.names <- patient.names[is_in_file]\n",
    "            rownames(factors)<-patient.names\n",
    "        }\n",
    "        # Match indices of patient names\n",
    "        indices <- match(patient.names, patient.names.in.file)\n",
    "        # Use indices to extract coresponding survival information\n",
    "        ordered.survival.data <- survival[indices,]\n",
    "        # Clean data (assign 0 to NAs)\n",
    "        ordered.survival.data$Survival[is.na(ordered.survival.data$Survival)] <- 0\n",
    "        ordered.survival.data$Death[is.na(ordered.survival.data$Death)] <- 0\n",
    "\n",
    "        # Calculate coxph\n",
    "        coxph_obj <- coxph(Surv(ordered.survival.data$Survival, ordered.survival.data$Death) ~ factors)\n",
    "        # P-values (corrected by the number of methods)\n",
    "        pvalues <- length(factorizations)*as.matrix(coef(summary(coxph_obj))[,5])\n",
    "\n",
    "        # How many significant? \n",
    "        factors_cancer <- c(factors_cancer, sum(pvalues<0.05))\n",
    "        # Store p-values\n",
    "        surv_final <- cbind(surv_final, pvalues) \n",
    "    }\n",
    "    # Keep -log10 of p-values\n",
    "    surv_final<-(-log10(surv_final))\n",
    "    # Plot survival pvalues for each cancer type separately\n",
    "    png(file=paste0(out.folder, \"survival_\", cancer, \".png\"), width = 15, height = 15, units = 'in', res = 200)\n",
    "    matplot(1:length(method), t(surv_final), \n",
    "            col=\"black\", pch=18, xlab=\"Method\", ylab=\"Pvalues survival\", xaxt=\"none\", cex=1.5)\n",
    "    abline(h = (-log10(0.05)), v=0, col=\"black\", lty=3, lwd=3)\n",
    "    axis(1, at=1:length(method), labels=colnames(surv_final)) \n",
    "    dev.off()\n",
    "\n",
    "#    print(factors_cancer)\n",
    "    return(factors_cancer)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison based on association to biological annotations (REACTOME, Hallmarks, GO)基于与生物注释的关联的比较(反应组、标志、GO)\n",
    "We define here a function that performs the jDR comparison based on their association to biological annotations. In particular, we here compute the number of factors significantly associated with at least a biological annotation and their selectivity (see Methods paper). 在这里定义一个函数,该函数根据它们与生物注释的关联来执行jDR比较。特别是,我们在这里计算了与至少一个生物注释及其选择性显著相关的因子的数量(见方法论文)。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(\"fgsea\", quietly = TRUE)\n",
    "\n",
    "## Perform biological annotation-based comparison \n",
    "## INPUTS:\n",
    "# factorizations = already computed factirizations\n",
    "# path.database = path to a GMT annotation file\n",
    "# pval.thr = p-value threshold (default to 0.05)\n",
    "## OUPUTS: a list containing output values\n",
    "# selectivity = Selectivity (fraction of significant annotations per all significant factors)\n",
    "# nonZeroFacs = Number of unique factors that were significant at least once\n",
    "# total_pathways = Number of clinical annotations with at least one significant factor component\n",
    "##执行基于生物注释的比较\n",
    "##输入:\n",
    "#factorizations=已计算的因式分解\n",
    "# path.database =GMT注释文件的路径\n",
    "# pval.thr =p值阈值(默认为0.05)\n",
    "##OUPUTS:包含输出值的列表\n",
    "#selectivity=选择性(每个重要因素的重要注释分数)\n",
    "#nonZeroFacs=至少一次显著的唯一因素的数量\n",
    "#total_pathways=至少有一个显著因子成分的临床注释数\n",
    "biological_comparison <- function(factorizations, path.database, pval.thr=0.05){\n",
    "    \n",
    "    # Load annotation database\n",
    "    pathways <- gmtPathways(path.database)\n",
    "    \n",
    "    # Containers to report results\n",
    "    report_number <- numeric(0)\n",
    "    report_nnzero <- numeric(0)\n",
    "    report_select <- numeric(0)\n",
    "    \n",
    "    # For each factorization method\n",
    "    for(i in 1:length(factorizations)){\n",
    "        \n",
    "        # Extract metagenes found by factorization method\n",
    "        metagenes<-factorizations[[i]][[2]][[1]]\n",
    "        # Number of factors\n",
    "        num.factors <- ncol(metagenes)\n",
    "        # Rename columns\n",
    "        colnames(metagenes)<-1:num.factors\n",
    "        # Rename rows to remove \"|\" characters and keep only the gene name before\n",
    "        rownames(metagenes)<-gsub(\"\\\\|\",\".\",rownames(metagenes))\n",
    "        rownames(metagenes)<-gsub(\"\\\\..*\",\"\",rownames(metagenes))\n",
    "        # Remove duplicated gene names that could confuse fgsea\n",
    "        duplicated_names <- unique(rownames(metagenes)[duplicated(rownames(metagenes))])\n",
    "        metagenes <- metagenes[!(rownames(metagenes) %in% duplicated_names), ]\n",
    "\n",
    "        # Variables\n",
    "        min_pval <- numeric(0)\n",
    "        path <- numeric(0)\n",
    "        n <- 0\n",
    "        \n",
    "        # Calculate biological annotation enrichment.\n",
    "        # For each factor,\n",
    "        for(j in 1:num.factors){\n",
    "            # Assign gene names\n",
    "            rnk <- setNames(as.matrix(metagenes[,j]), rownames(metagenes))\n",
    "            # Compute fgsea\n",
    "            fgseaRes <- fgsea(pathways, rnk, minSize=15, maxSize=500, nperm=1000)\n",
    "            # If at least one pathway is significant\n",
    "            if(sum(fgseaRes$padj < pval.thr)!=0){\n",
    "                # Count this factor\n",
    "                n <- n+1\n",
    "                # Keep min adjusted p-value\n",
    "                min_pval <- rbind(min_pval, min(fgseaRes$padj))\n",
    "                # Keep names of significant pathways\n",
    "                path <- c(path, fgseaRes[fgseaRes$padj<pval.thr, \"pathway\"])\n",
    "            } else {\n",
    "                min_pval <- rbind(min_pval, NA)\n",
    "            }\n",
    "        }\n",
    "\n",
    "        # Report number of unique significant pathways  \n",
    "        if(length(path)==0){\n",
    "            report_number <- rbind(report_number, 0)\n",
    "        }else{\n",
    "            report_number <- rbind(report_number, length(unique(path)))\n",
    "        }\n",
    "        # Report selectivity \n",
    "        if(length(unique(path))==0){\n",
    "            report_select <- rbind(report_select, NA)\n",
    "        }else{\n",
    "            al<-length(unique(path))/length(path)\n",
    "            fl<-length(which(!is.na(min_pval)))/length(path)\n",
    "            report_select <- rbind(report_select, (al+fl)/2)\n",
    "        }\n",
    "        # Report number of factors associated with at least one significant pathway\n",
    "        report_nnzero<-rbind(report_nnzero, n)    \n",
    "    \n",
    "    }\n",
    "    \n",
    "    out <- data.frame(selectivity=report_select, nonZeroFacs=report_nnzero, total_pathways=report_number)\n",
    "#    print(out)\n",
    "    return(out)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running the comparisons in cancer 对癌症进行比较\n",
    "\n",
    "We here run the three functions defined above to perform the comparison on the cancer data. The cancer data should be organized into the data folder, each of them having the name of a different cancer type and containin the various omics data (3 omics for our test cases). Using the script povided in our github this step is automatically performed.\n",
    "我们在这里运行上面定义的三个函数来对癌症数据执行比较。癌症数据应该被组织到数据文件夹中,每个数据文件夹都有不同癌症类型的名称,并且包含在不同的组学数据中(我们的测试案例有3个组学)。使用github中提供的脚本,这一步将自动执行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading required package: MASS\n",
      "\n",
      "Loading required package: NMF\n",
      "\n",
      "Loading required package: pkgmaker\n",
      "\n",
      "Loading required package: registry\n",
      "\n",
      "Loading required package: rngtools\n",
      "\n",
      "Loading required package: cluster\n",
      "\n",
      "NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 27/28\n",
      "\n",
      "  To enable shared memory capabilities, try: install.extras('\n",
      "NMF\n",
      "')\n",
      "\n",
      "Loading required package: mclust\n",
      "\n",
      "Package 'mclust' version 5.4.6\n",
      "Type 'citation(\"mclust\")' for citing this R package in publications.\n",
      "\n",
      "Loading required package: InterSIM\n",
      "\n",
      "Loading required package: tools\n",
      "\n",
      "Loading required package: ade4\n",
      "\n",
      "\n",
      "Attaching package: ‘ade4’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:BiocGenerics’:\n",
      "\n",
      "    score\n",
      "\n",
      "\n",
      "\n",
      "Attaching package: ‘GPArotation’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:NMF’:\n",
      "\n",
      "    entropy\n",
      "\n",
      "\n",
      "\n",
      "Attaching package: ‘MOFAtools’\n",
      "\n",
      "\n",
      "The following objects are masked from ‘package:NMF’:\n",
      "\n",
      "    featureNames, featureNames<-, predict, sampleNames, sampleNames<-\n",
      "\n",
      "\n",
      "The following objects are masked from ‘package:Biobase’:\n",
      "\n",
      "    featureNames, featureNames<-, sampleNames, sampleNames<-\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:fgsea’:\n",
      "\n",
      "    plotEnrichment\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    predict\n",
      "\n",
      "\n",
      "Loading required package: JADE\n",
      "\n",
      "Loading required package: lattice\n",
      "\n",
      "Loading required package: caTools\n",
      "\n",
      "Loading required package: gdata\n",
      "\n",
      "gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.\n",
      "\n",
      "\n",
      "\n",
      "gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.\n",
      "\n",
      "\n",
      "Attaching package: ‘gdata’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:Biobase’:\n",
      "\n",
      "    combine\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:BiocGenerics’:\n",
      "\n",
      "    combine\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    nobs\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:utils’:\n",
      "\n",
      "    object.size\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:base’:\n",
      "\n",
      "    startsWith\n",
      "\n",
      "\n",
      "Loading required package: gtools\n",
      "\n",
      "\n",
      "Attaching package: ‘gtools’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:InterSIM’:\n",
      "\n",
      "    logit\n",
      "\n",
      "\n",
      "Loading required package: gplots\n",
      "\n",
      "\n",
      "Attaching package: ‘gplots’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    lowess\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Load the function running the factorization, plus a support function\n",
    "source(\"runfactorization.R\")\n",
    "source(\"log2matrix.R\")\n",
    "\n",
    "# List downloaded cancer data.\n",
    "# Folder structure should be organized as discussed above.\n",
    "# Exclude first result as it's the parent folder\n",
    "cancers <- list.dirs(path = \"../data/cancer\", full.names = TRUE, recursive = TRUE)[-1]\n",
    "cancer_names <- list.dirs(path = \"../data/cancer\", full.names = FALSE, recursive = TRUE)[-1]\n",
    "\n",
    "# Annotation databases used for biological enrichment\n",
    "path.database <- \"../data/bio_annotations/c2.cp.reactome.v6.2.symbols.gmt\" #REACTOME\n",
    "#path.database <- \"../data/bio_annotations/h.all.v6.2.symbols.gmt\" #Hallmarks\n",
    "#path.database <- \"../data/bio_annotations/c5.all.v6.2.symbols.gmt\" #GO\n",
    "\n",
    "# Label to identify current run\n",
    "tag <- format(Sys.time(), \"%Y%m%d%H%M%S\")\n",
    "# Folder for comparison results\n",
    "results_folder <- paste0(\"../results\", tag, \"/\")\n",
    "# Create output folder\n",
    "dir.create(results_folder, showWarnings = FALSE)\n",
    "\n",
    "# Number of factors used in the paper\n",
    "num.factors <- 10\n",
    "\n",
    "# Initialize result containers\n",
    "clinical_analysis <- data.frame(\n",
    "    matrix(data = NA, ncol=5, nrow=0, \n",
    "           dimnames = list(c(), c(\"methods\", \"cancer\", \"selectivity\", \"nonZeroFacs\", \"total_pathways\"))\n",
    "          ),\n",
    "    stringsAsFactors = FALSE)\n",
    "\n",
    "biological_analysis <- data.frame(\n",
    "    matrix(data = NA, ncol=5, nrow=0, \n",
    "           dimnames = list(c(), c(\"methods\", \"cancer\", \"selectivity\", \"nonZeroFacs\", \"total_pathways\"))\n",
    "          ),\n",
    "    stringsAsFactors = FALSE)\n",
    "cancer.list <- list()\n",
    "\n",
    "# Clinical categories to be used for clinical tests\n",
    "col <- c(\"age_at_initial_pathologic_diagnosis\",\n",
    "         \"gender\",\n",
    "         \"days_to_new_tumor_event_after_initial_treatment\",\n",
    "         \"history_of_neoadjuvant_treatment\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"../data/cancer/aml\"      \"../data/cancer/breast\"  \n",
      "[3] \"../data/cancer/colon\"    \"../data/cancer/kidney\"  \n",
      "[5] \"../data/cancer/liver\"    \"../data/cancer/lung\"    \n",
      "[7] \"../data/cancer/melanoma\" \"../data/cancer/ovarian\" \n",
      "[9] \"../data/cancer/sarcoma\" \n"
     ]
    }
   ],
   "source": [
    "print(cancers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Now analysing ../data/cancer/aml\"\n",
      "[1] \"Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:1234\n",
      "[1] \"Running survival analysis...\"\n",
      "[1] \"Running clinical analysis...\"\n",
      "[1] \"Running biological analysis...\"\n",
      "[1] \"Now analysing ../data/cancer/breast\"\n",
      "[1] \"Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:123456\n",
      "[1] \"Running survival analysis...\"\n",
      "[1] \"Running clinical analysis...\"\n",
      "[1] \"Running biological analysis...\"\n",
      "[1] \"Now analysing ../data/cancer/colon\"\n",
      "[1] \"Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:123456\n",
      "[1] \"Running survival analysis...\"\n",
      "[1] \"Running clinical analysis...\"\n",
      "[1] \"Running biological analysis...\"\n",
      "[1] \"Now analysing ../data/cancer/kidney\"\n",
      "[1] \"Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:12345678910\n",
      "[1] \"Running survival analysis...\"\n",
      "[1] \"Running clinical analysis...\"\n",
      "[1] \"Running biological analysis...\"\n",
      "[1] \"Now analysing ../data/cancer/liver\"\n",
      "[1] \"Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:123"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message:\n",
      "“did not converge in 10 iterations”\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "456\n",
      "[1] \"Running survival analysis...\"\n",
      "[1] \"Running clinical analysis...\"\n",
      "[1] \"Running biological analysis...\"\n",
      "[1] \"Now analysing ../data/cancer/lung\"\n",
      "[1] \"Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:12345\n",
      "[1] \"Running survival analysis...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in max(event[who2]):\n",
      "“max里所有的参数都不存在;回覆-Inf”\n"
     ]
    },
    {
     "ename": "ERROR",
     "evalue": "Error in coxph(Surv(ordered.survival.data$Survival, ordered.survival.data$Death) ~ : No (non-missing) observations\n",
     "output_type": "error",
     "traceback": [
      "Error in coxph(Surv(ordered.survival.data$Survival, ordered.survival.data$Death) ~ : No (non-missing) observations\nTraceback:\n",
      "1. survival_comparison(out$factorizations, out$method, survival, \n .     results_folder, current_cancer)",
      "2. coxph(Surv(ordered.survival.data$Survival, ordered.survival.data$Death) ~ \n .     factors)   # at line 56 of file <text>",
      "3. stop(\"No (non-missing) observations\")"
     ]
    }
   ],
   "source": [
    "# For each cancer dataset\n",
    "for(i in cancers){\n",
    "\n",
    "    print(paste0(\"Now analysing \", i))\n",
    "    \n",
    "    # Name of current cancer\n",
    "    current_cancer <- basename(i)\n",
    "\n",
    "    # If the expression and miRNA data are not log2-transformed as for those provided by XX et al.\n",
    "    log2matrix(i,\"exp\")\n",
    "    log2matrix(i,\"mirna\")\n",
    "\n",
    "    # Perform factorisation\n",
    "    print(\"Running factorisation...\")\n",
    "    out <- runfactorization(i, c(\"log_exp\",\"methy\",\"log_mirna\"), num.factors, sep=\" \", filtering=\"sd\")\n",
    "    \n",
    "    # Survival analysis\n",
    "    print(\"Running survival analysis...\")\n",
    "    survival <- read.table(paste0(i, \"/survival\"), sep=\"\\t\", header=TRUE, stringsAsFactors=FALSE)\n",
    "    out_survival <- survival_comparison(out$factorizations, out$method, survival, \n",
    "                                        results_folder, current_cancer)\n",
    "    \n",
    "    # Clinical analysis\n",
    "    print(\"Running clinical analysis...\")\n",
    "    clinical <- read.table(paste0(\"../data/clinical/\", current_cancer), sep=\"\\t\", header=TRUE)\n",
    "    out_clinical <- clinical_comparison(out$factorizations, clinical, col)   \n",
    "    clinical_analysis <- rbind(clinical_analysis,\n",
    "                                data.frame(methods=out$method, cancer=current_cancer, out_clinical))    \n",
    "\n",
    "    # Biological analysis\n",
    "    print(\"Running biological analysis...\")\n",
    "    out_bio <- biological_comparison(out$factorizations, path.database, pval.thr=0.05)\n",
    "    biological_analysis <- rbind(biological_analysis,\n",
    "                                data.frame(methods=out$method, cancer=current_cancer, out_bio))    \n",
    "}\n",
    "rownames(clinical_analysis) <- c()\n",
    "rownames(biological_analysis) <- c()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving results to file\n",
    "Here the obtained results are saved to file in the Results folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export results into separated tables\n",
    "write.table(biological_analysis, paste0(results_folder, \"results_biological_analysis.txt\"), \n",
    "            sep=\"\\t\", row.names=FALSE)\n",
    "\n",
    "write.table(clinical_analysis, paste0(results_folder, \"results_clinical_analysis.txt\"), \n",
    "            sep=\"\\t\", row.names=FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plots of the obtained results with respect to biological annotations\n",
    "Plots are then created accordingly to Figure 4 and Supp Figure 2 of the paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(ggplot2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#tiff(\"BioEnr_go_fig1.tiff\", units=\"in\", width=8.95, height=6.05, res=300)\n",
    "min_nonZero = min(biological_analysis[, \"nonZeroFacs\"]) \n",
    "max_nonZero = max(biological_analysis[, \"nonZeroFacs\"]) \n",
    "g <- ggplot(biological_analysis, \n",
    "            aes(x=nonZeroFacs,y=selectivity)) + \n",
    "    geom_point(aes(colour = methods, shape = cancer), size=5, alpha=.6, position=position_jitter(h=0, w=0.15))+ \n",
    "    theme_bw() + \n",
    "    scale_shape_manual(values=c(15,17,16)) + \n",
    "    scale_color_manual(values=c('#FF00FF', '#FF6E28', '#C8961E', '#FF0000', '#0000FF', '#A0A0A0', '#48D1CC', '#00FF00')) +\n",
    "    ylim(floor(min((biological_analysis[,\"selectivity\"]*10)-.4)) / 10,\n",
    "         ceiling(max((biological_analysis[,\"selectivity\"]*10)+.2)) / 10) +\n",
    "    labs(title=\"Biological annotations\", \n",
    "         x=\"# metagenes (factors) enriched in at least one annotation\") +\n",
    "    theme(plot.title = element_text(size=14,face=\"bold\"),\n",
    "          axis.text = element_text(size=11),\n",
    "          axis.title = element_text(size=13),\n",
    "          legend.text=element_text(size=10)) +\n",
    "    ylab(\"Selectivity\") + \n",
    "    labs(colour = \"Methods\", shape = \"Cancer\") +\n",
    "    guides(color = guide_legend(order = 1),shape = guide_legend(order = 2),size = guide_legend(order = 3)) + \n",
    "    #scale_x_discrete() + scale_x_discrete(limits=min_nonZero:max_nonZero) + \n",
    "    scale_x_discrete(limits=min_nonZero:max_nonZero, labels = c(min_nonZero:max_nonZero));\n",
    "g\n",
    "#dev.off()\n",
    "\n",
    "ggsave(paste0(results_folder, \"biological_annotations.pdf\"),dpi=300)\n",
    "ggsave(paste0(results_folder, \"biological_annotations.png\"),dpi=300) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plots of the obtained results with respect to clinical annotations\n",
    "Plots are then created accordingly to Figure 4 and Supp Figure 2 of the paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(ggplot2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#tiff(\"BioEnr_go_fig1.tiff\", units=\"in\", width=8.95, height=6.05, res=300)\n",
    "min_nonZero = min(clinical_analysis[, \"nonZeroFacs\"]) \n",
    "max_nonZero = max(clinical_analysis[, \"nonZeroFacs\"]) \n",
    "g <- ggplot(clinical_analysis, \n",
    "            aes(x=nonZeroFacs, y=selectivity)) + \n",
    "    geom_point(aes(colour = methods, shape = cancer), size=5, alpha=.6, position=position_jitter(h=0, w=0.15))+ \n",
    "    theme_bw() + \n",
    "    scale_shape_manual(values=c(15,17,16)) + \n",
    "    scale_color_manual(values=c('#FF00FF', '#FF6E28', '#C8961E', '#FF0000', '#0000FF', '#A0A0A0', '#48D1CC', '#00FF00')) +\n",
    "    ylim(floor(min((clinical_analysis[,\"selectivity\"]*10)-.4)) / 10,\n",
    "         ceiling(max((clinical_analysis[,\"selectivity\"]*10)+.2)) / 10) +\n",
    "    labs(title=\"Clinical annotations\", \n",
    "         x=\"# metagenes (factors) enriched in at least one annotation\") +\n",
    "    theme(plot.title = element_text(size=14,face=\"bold\"),\n",
    "          axis.text = element_text(size=11),\n",
    "          axis.title = element_text(size=13),\n",
    "          legend.text=element_text(size=10)) +\n",
    "    ylab(\"Selectivity\") + \n",
    "    labs(colour = \"Methods\",shape=\"Cancer\") +\n",
    "    guides(color = guide_legend(order = 1),shape = guide_legend(order = 2),size = guide_legend(order = 3)) + \n",
    "    #scale_x_discrete() + scale_x_discrete(limits=min_nonZero:max_nonZero) + \n",
    "    scale_x_discrete(limits=min_nonZero:max_nonZero, labels = c(min_nonZero:max_nonZero));\n",
    "g\n",
    "#dev.off()\n",
    "\n",
    "ggsave(paste0(results_folder, \"clinical_annotations.pdf\"),dpi=300)\n",
    "ggsave(paste0(results_folder, \"clinical_annotations.png\"),dpi=300) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}