[b33e61]: / R-scripts / scripts / Comparison_in_simulated_data.ipynb

Download this file

755 lines (754 with data), 28.1 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SUB-BENCHMARK 1: Comparing jDRs performances on simulated data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We here reproduce the first sub-benchmark present in the paper. The performances of the 9 multi-omics jDRs are thus compared based on their clustering performances on simulated datasets.\n",
    "\n",
    "## Construction of the simulated datasets\n",
    "\n",
    "We first define the function that produces the simulated datasets. The function simulates 3-omics datasets (mRNA expression, methylation and protein quantification), using the CRAN InserSIM package. Each dataset is composed of 100 samples. We considered different scenarios. First, 'equally spaced' clusters (i.e., composed of the same number of samples) and heterogeneous clusters (i.e., composed of a variable number of samples) can be generated. Second, different numbers of clusters are imposed on the data: 5,10,15. \n",
    "\n",
    "我们首先定义产生模拟数据集的函数。 该功能使用CRAN InserSIM软件包模拟3-omics数据集(mRNA表达,甲基化和蛋白质定量)。 每个数据集由100个样本组成。 我们考虑了不同的情况。 首先,可以产生“等距”的簇(即,由相同数量的样本组成)和异构的簇(即,由可变数量的样本组成)。 其次,对数据施加不同数量的聚类:5、10、15。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 27/28\n",
      "\n",
      "  To enable shared memory capabilities, try: install.extras('\n",
      "NMF\n",
      "')\n",
      "\n",
      "Loading required package: mclust\n",
      "\n",
      "Package 'mclust' version 5.4.6\n",
      "Type 'citation(\"mclust\")' for citing this R package in publications.\n",
      "\n",
      "Loading required package: ade4\n",
      "\n",
      "\n",
      "Attaching package: ‘ade4’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:BiocGenerics’:\n",
      "\n",
      "    score\n",
      "\n",
      "\n",
      "\n",
      "Attaching package: ‘GPArotation’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:NMF’:\n",
      "\n",
      "    entropy\n",
      "\n",
      "\n",
      "\n",
      "Attaching package: ‘MOFAtools’\n",
      "\n",
      "\n",
      "The following objects are masked from ‘package:NMF’:\n",
      "\n",
      "    featureNames, featureNames<-, predict, sampleNames, sampleNames<-\n",
      "\n",
      "\n",
      "The following objects are masked from ‘package:Biobase’:\n",
      "\n",
      "    featureNames, featureNames<-, sampleNames, sampleNames<-\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    predict\n",
      "\n",
      "\n",
      "Loading required package: JADE\n",
      "\n",
      "Loading required package: lattice\n",
      "\n",
      "Loading required package: caTools\n",
      "\n",
      "Loading required package: gdata\n",
      "\n",
      "gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.\n",
      "\n",
      "\n",
      "\n",
      "gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.\n",
      "\n",
      "\n",
      "Attaching package: ‘gdata’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:Biobase’:\n",
      "\n",
      "    combine\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:BiocGenerics’:\n",
      "\n",
      "    combine\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    nobs\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:utils’:\n",
      "\n",
      "    object.size\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:base’:\n",
      "\n",
      "    startsWith\n",
      "\n",
      "\n",
      "Loading required package: gtools\n",
      "\n",
      "\n",
      "Attaching package: ‘gtools’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:InterSIM’:\n",
      "\n",
      "    logit\n",
      "\n",
      "\n",
      "Loading required package: gplots\n",
      "\n",
      "\n",
      "Attaching package: ‘gplots’\n",
      "\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    lowess\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "library(\"InterSIM\", quietly = TRUE)\n",
    "source(\"runfactorization.R\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Base folder for data\n",
    "data_folder <- \"../data/\"\n",
    "# Label to identify current run\n",
    "tag <- format(Sys.time(), \"%Y%m%d%H%M%S\")\n",
    "# Folder containing simulated data\n",
    "simul_folder <- paste0(data_folder, \"simulations_\", tag, \"/\") \n",
    "# Folder for comparison results\n",
    "results_folder <- paste0(\"../results\", tag, \"/\")\n",
    "\n",
    "dir.create(data_folder, showWarnings = FALSE)\n",
    "dir.create(simul_folder, showWarnings = FALSE)\n",
    "dir.create(results_folder, showWarnings = FALSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Simulate data\n",
    "## INPUTS:\n",
    "# folder = location  where the simulated data should be saved\n",
    "# num.clusters = number of clusters to be imposed on the data\n",
    "# size = heterogeneous for heterogeneous clusters, equal for equally-sized clusters\n",
    "## OUPUTS: matrices of simulated data are saved to file in folder\n",
    "##模拟数据\n",
    "##输入:\n",
    "#folder=应保存模拟数据的位置\n",
    "#num.clusters =要施加在数据上的集群数\n",
    "#size =对于异构集群,heterogeneous,对于相等大小的集群,equal\n",
    "## OUPUTS:模拟数据矩阵保存到文件夹中的文件中\n",
    "simulated_data_generation <- function(out.folder, num.clusters, size=\"heterogeneous\", predefined=TRUE) {\n",
    "    \n",
    "    # Number of clusters\n",
    "    num.clusters <- as.numeric(num.clusters)\n",
    "    # Size of the effect\n",
    "    effect <- 2.5\n",
    "    # Sample proportions per clusters defined here are those used for the paper\n",
    "    #此处定义的每类样本比例是本文使用的比例\n",
    "    prop_predefined <- list(\n",
    "        \"heterogeneous\" = list(\n",
    "            \"5\" = c(0.35, 0.13, 0.19, 0.08, 0.25),\n",
    "            \"10\" = c(0.20, 0.10, 0.07, 0.10, 0.15, 0.13, 0.10, 0.08, 0.05, 0.02),\n",
    "            \"15\" = c(0.10,0.08,0.04,0.03,0.12,0.03,0.10,0.03,0.05,0.02,0.1,0.2,0.03,0.02,0.05)\n",
    "        ),\n",
    "        \"equal\" = list(\n",
    "            \"5\" = c(0.25,0.2,0.2,0.2,0.15),\n",
    "            \"10\" = c(0.15,0.1,0.1,0.1,0.1,0.1,0.05,0.1,0.1,0.1),\n",
    "            \"15\" = c(0.07,0.07,0.07,0.06,0.07,0.07,0.07,0.06,0.07,0.06,0.07,0.06,0.07,0.06,0.07)\n",
    "        )\n",
    "    )\n",
    "\n",
    "    # Check provided parameter (size) against allowed values\n",
    "    if(! size %in% names(prop_predefined)) {\n",
    "        print(paste0(\"ERROR: size can only assume value : \", \n",
    "                     paste0(names(prop_predefined), collapse=\",\"),\n",
    "                     \" found : \", size))\n",
    "    }\n",
    "\n",
    "    # If article proportions are to be used\n",
    "    if(predefined) {\n",
    "        # Check provided parameter (number of clusters) against allowed values\n",
    "        if(! as.character(num.clusters) %in% names(prop_predefined[[size]])) {\n",
    "            print(paste0(\"ERROR: num.clusters can only assume value : \", \n",
    "                         paste0(names(prop_predefined[[size]]), collapse=\",\"),\n",
    "                         \" found : \",\n",
    "                         num.clusters))\n",
    "        }\n",
    "        prop <- prop_predefined[[size]][[as.character(num.clusters)]]\n",
    "        prop[1] <- 1-sum(prop[-1])\n",
    "    }\n",
    "    # Otherwise\n",
    "    else {\n",
    "        if(size == \"equal\") {\n",
    "            # Could be simplified! Only necessary because InterSIM is \"easily offended\" :\n",
    "            # ensure same object type as in the heterogeneous case, and that not all \n",
    "            # values are exactly the same (should not impact the number of samples per group)\n",
    "            # - same type\n",
    "            equals <- rep(1, num.clusters)\n",
    "            prop <- equals/sum(equals)\n",
    "            # - slightly imbalanced\n",
    "            delta <- 0.05*prop[1]\n",
    "            prop[1] <- prop[1]+delta\n",
    "            prop[num.clusters] <- prop[num.clusters]-delta\n",
    "            # - sum is 1\n",
    "            prop <- round(prop, digits = 10)\n",
    "            prop[1] <- 1-sum(prop[-1])\n",
    "        }\n",
    "        else {\n",
    "            random <- runif(n = num.clusters, min = 0, max = 1)\n",
    "            prop <- random/sum(random)\n",
    "        }\n",
    "    }\n",
    "\n",
    "    # Simulate data based on provided parameters\n",
    "    print(prop)\n",
    "    print(sum(prop))\n",
    "    print(sum(prop)==1)\n",
    "    sim.D <- InterSIM(n.sample=100, cluster.sample.prop=prop, \n",
    "                      delta.methyl=effect, delta.expr=effect, delta.protein=effect, \n",
    "                      p.DMP=0.25, p.DEG=NULL, p.DEP=NULL,\n",
    "                      do.plot=FALSE, sample.cluster=TRUE, feature.cluster=TRUE)\n",
    "\n",
    "    # Export simulations as tables\n",
    "    write.table(sim.D$clustering.assignment, paste(out.folder, \"clusters.txt\", sep=\"/\"), sep=\"\\t\")\n",
    "    write_table_with_index_header(t(sim.D$dat.methyl), paste(out.folder, \"omics1.txt\", sep=\"/\"))\n",
    "    write_table_with_index_header(t(sim.D$dat.expr), paste(out.folder, \"omics2.txt\", sep=\"/\"))\n",
    "    write_table_with_index_header(t(sim.D$dat.protein), paste(out.folder, \"omics3.txt\", sep=\"/\"))\n",
    "\n",
    "    return(\"data saved in folder\")\n",
    "}\n",
    "\n",
    "## Support function\n",
    "write_table_with_index_header <- function(data, file, sep=\"\\t\") {\n",
    "    write.table(cbind(probe=row.names(data),data), file, sep = sep, \n",
    "                append = FALSE, quote = FALSE, row.names = FALSE, col.names = TRUE)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison based on clustering assignment\n",
    "\n",
    "We here define the function to compare the clustering performances of the 9 jDRs. The function computes the Jaccard Index of the clusters predicted from the different jDR vs. the ground-truth clusters imposed during simulations. It is to note that, when necessary, the cluster assignment is computed from the factors by kmeans clustering (see Methods of the paper for further details)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Compare factorizations obtained through different methods\n",
    "## INPUTS: \n",
    "# factorizations = list of factorization results\n",
    "# methods = list of ran factorization methods\n",
    "# in.folder = path to input folder\n",
    "# out.folder = path to output folder\n",
    "# icluster.clusters = clustering result of iCluster\n",
    "# intNMF.clusters = clustering result of intNMF\n",
    "# number_cl = number of clusters\n",
    "## OUPUTS: matrix of Jaccard Indexes is saved to table in out.folder\n",
    "clusters_comparison <- function(factorizations, methods, in.folder, out.folder,\n",
    "                                icluster.clusters, intNMF.clusters, number_cl) {\n",
    "\n",
    "    ind <- 0\n",
    "    JI_final <- matrix(data=NA, nrow=number_cl, ncol=length(factorizations))\n",
    "\n",
    "    # Read clusters imposed on simulated data \n",
    "    cl  <- as.matrix(read.table(paste0(in.folder, \"clusters.txt\"), \n",
    "                                sep=\"\\t\", row.names=1, header=TRUE))\n",
    "    cl2 <- as.matrix(as.numeric(cl[,2]))\n",
    "    rownames(cl2) <- cl[,1]\n",
    "\n",
    "    # Assigning samples to cluster based on factors\n",
    "    for (i in 1:length(factorizations)) {\n",
    "        \n",
    "        if (methods[i]==\"iCluster\" || methods[i]==\"intNMF\") {\n",
    "            \n",
    "            if(methods[i]==\"iCluster\") {\n",
    "                clust_iCluster <- as.matrix(icluster.clusters)\n",
    "            } else {\n",
    "                clust_iCluster <- as.matrix(intNMF.clusters)\n",
    "            }\n",
    "            \n",
    "            # Creation sets of samples\n",
    "            JI_mat <- numeric(0)\n",
    "            for (p in 1:number_cl) {\n",
    "                x1 <- rownames(clust_iCluster)[which(clust_iCluster[,1]==p)]\n",
    "                row <- numeric(0)\n",
    "                for (j in 1:number_cl) {\n",
    "                    x2  <- rownames(cl2)[which(cl2[,1]==j)]\n",
    "                    I   <- length(intersect(x1,x2))\n",
    "                    S   <- I/(length(x1)+length(x2)-I)\n",
    "                    row <- cbind(row,S)\n",
    "                }\n",
    "                JI_mat <- rbind(JI_mat,row)\n",
    "            }\n",
    "            JI_final[1:number_cl,i]<-apply(JI_mat,1,max)\n",
    "            \n",
    "        } else {\n",
    "            \n",
    "            factors <- factorizations[[i]][[1]]\n",
    "            \n",
    "            # Clustering by Kmeans\n",
    "            JI_good <- numeric(0)\n",
    "            for (run in 1:1000) {\n",
    "                kmeans.out <- kmeans(factors, centers=number_cl)\n",
    "                kmeans.out\n",
    "                clust_iCluster <- as.matrix(kmeans.out$cluster)\n",
    "                ######creation sets of samples\n",
    "                JI_mat <- numeric(0)\n",
    "                for (p in 1:number_cl) {\n",
    "                    x1  <- rownames(clust_iCluster)[which(clust_iCluster[,1]==p)]\n",
    "                    row <- numeric(0)\n",
    "                    for(j in 1:number_cl) {\n",
    "                        x2  <- rownames(cl2)[which(cl2[,1]==j)]\n",
    "                        I   <- length(intersect(x1,x2))\n",
    "                        S   <- I/(length(x1)+length(x2)-I)\n",
    "                        row <- cbind(row,S)\n",
    "                     }\n",
    "                     JI_mat <- rbind(JI_mat,row)\n",
    "                 }\n",
    "                 JI_good <- rbind(JI_good,apply(JI_mat,1,max))\n",
    "            }\n",
    "            JI_final[1:number_cl,i] <- apply(JI_good,2,mean)\n",
    "        }\n",
    "    }\n",
    "    \n",
    "    # Export JI values to a file \n",
    "    write.table(JI_final, paste0(out.folder,\"/JI.txt\"), sep=\"\\t\", row.names=FALSE, col.names=methods)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running the comparisons\n",
    "\n",
    "We here run the jDR comparison. First the function to simulate the data, then the factorization through the external function run_factorization and then the function comparing the performances based on the Jaccard Index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"##########\"\n",
      "[1] \"-> Distribution: heterogeneous, Nb clusters: 5\"\n",
      "[1] \"-> Simulating data...\"\n",
      "[1] 0.35 0.13 0.19 0.08 0.25\n",
      "[1] 1\n",
      "[1] TRUE\n",
      "[1] \"-> Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=6:123\n",
      "[1] \"-> Comparing results...\"\n",
      "[1] \"-> Done.\"\n",
      "[1] \"##########\"\n",
      "[1] \"-> Distribution: equal, Nb clusters: 5\"\n",
      "[1] \"-> Simulating data...\"\n",
      "[1] 0.25 0.20 0.20 0.20 0.15\n",
      "[1] 1\n",
      "[1] TRUE\n",
      "[1] \"-> Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=6:12\n",
      "[1] \"-> Comparing results...\"\n",
      "[1] \"-> Done.\"\n",
      "[1] \"##########\"\n",
      "[1] \"-> Distribution: heterogeneous, Nb clusters: 10\"\n",
      "[1] \"-> Simulating data...\"\n",
      " [1] 0.20 0.10 0.07 0.10 0.15 0.13 0.10 0.08 0.05 0.02\n",
      "[1] 1\n",
      "[1] TRUE\n",
      "[1] \"-> Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:12\n",
      "[1] \"-> Comparing results...\"\n",
      "[1] \"-> Done.\"\n",
      "[1] \"##########\"\n",
      "[1] \"-> Distribution: equal, Nb clusters: 10\"\n",
      "[1] \"-> Simulating data...\"\n",
      " [1] 0.15 0.10 0.10 0.10 0.10 0.10 0.05 0.10 0.10 0.10\n",
      "[1] 1\n",
      "[1] TRUE\n",
      "[1] \"-> Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=11:12\n",
      "[1] \"-> Comparing results...\"\n",
      "[1] \"-> Done.\"\n",
      "[1] \"##########\"\n",
      "[1] \"-> Distribution: heterogeneous, Nb clusters: 15\"\n",
      "[1] \"-> Simulating data...\"\n",
      " [1] 0.10 0.08 0.04 0.03 0.12 0.03 0.10 0.03 0.05 0.02 0.10 0.20 0.03 0.02 0.05\n",
      "[1] 1\n",
      "[1] TRUE\n",
      "[1] \"-> Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"No output file provided, using a temporary file...\"\n",
      "Generating warm start... \n",
      "K=16:12\n",
      "[1] \"-> Comparing results...\"\n",
      "[1] \"-> Done.\"\n",
      "[1] \"##########\"\n",
      "[1] \"-> Distribution: equal, Nb clusters: 15\"\n",
      "[1] \"-> Simulating data...\"\n",
      " [1] 0.07 0.07 0.07 0.06 0.07 0.07 0.07 0.06 0.07 0.06 0.07 0.06 0.07 0.06 0.07\n",
      "[1] 1\n",
      "[1] TRUE\n",
      "[1] \"-> Running factorisation...\"\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message in createMOFAobject(omics):\n",
      "“View names are not specified in the data, renaming them to: view_1, view_2, view_3\n",
      "”\n",
      "Creating MOFA object from list of matrices,\n",
      " please make sure that samples are columns and features are rows...\n",
      "\n",
      "\n",
      "Checking data options...\n",
      "\n",
      "Checking training options...\n",
      "\n",
      "Checking model options...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## Simulate data, factorize and compare the results\n",
    "\n",
    "list_clusters <- seq(5,15,5)\n",
    "list_distrib <-  c(\"heterogeneous\",\"equal\")\n",
    "\n",
    "# For a given number of clusters\n",
    "for(num.clusters in list_clusters) {\n",
    "    # Data distribution among clusters will either be heterogeneous or equal \n",
    "    for (size in list_distrib) {\n",
    "        \n",
    "        print(\"##########\")\n",
    "        print(paste0(\"-> Distribution: \", size, \", Nb clusters: \", num.clusters))\n",
    "        \n",
    "        # Make simulated data\n",
    "        print(\"-> Simulating data...\")\n",
    "        simulated_data_generation(simul_folder, num.clusters, size, predefined=TRUE)\n",
    "        # Run factorisation\n",
    "        print(\"-> Running factorisation...\")\n",
    "        out <- runfactorization(simul_folder, paste0(\"omics\",1:3,\".txt\"), num.clusters, sep=\"\\t\", filtering=\"none\")\n",
    "        # Folder for comparison \n",
    "        comp_folder <- paste0(results_folder, num.clusters, \"_\", size)\n",
    "        dir.create(comp_folder)\n",
    "        # Perform clustering and comparison of factorisations\n",
    "        print(\"-> Comparing results...\")\n",
    "        clusters_comparison(out$factorizations, out$method, \n",
    "                            simul_folder, comp_folder, \n",
    "                            out$icluster.clusters, out$intNMF.clusters, num.clusters)\n",
    "        \n",
    "        print(\"-> Done.\")\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating results and plots\n",
    "\n",
    "We here save to the Results folder the obtained Jaccard Indices and plot their values according to Figure 2 in the paper. The results that you will obtain will be slightly different from those of the paper, due to the stochasticity of the InterSIM package. The data that you will simulate will not have the same sample-cluster associations and the same features of those simulated in the paper. However, the jDRs having best performances remain the same despite such variations. For the clustering methods sometimes small variability in intNMF results is observed.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<strong>png:</strong> 2"
      ],
      "text/latex": [
       "\\textbf{png:} 2"
      ],
      "text/markdown": [
       "**png:** 2"
      ],
      "text/plain": [
       "png \n",
       "  2 "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Save all boxplots in a single PDF output file\n",
    "pdf(file=paste0(results_folder, \"simulated_boxplots.pdf\"), width = 15, height = 15, onefile = TRUE)\n",
    "\n",
    "# For each chosen number of clusters\n",
    "for (i in list_clusters) {\n",
    "    \n",
    "    # Output files for each distribution\n",
    "    eq_file <- paste0(results_folder, i, \"_equal/\", \"JI.txt\")\n",
    "    het_file <- paste0(results_folder, i, \"_heterogeneous/\", \"JI.txt\")\n",
    "    if(exists(\"JI.final\")) rm(JI.final)\n",
    "    \n",
    "    # Load clusters (equal distribution)\n",
    "    if(file.exists(eq_file)) {\n",
    "        JI.final  <- read.table(eq_file, sep=\"\\t\", header=TRUE)\n",
    "        names(JI.final) <- paste0(names(JI.final), \"_EQ\")        \n",
    "    }\n",
    "    cat\n",
    "    # Load clusters (heterogeneous distribution)\n",
    "    if(file.exists(het_file)) {\n",
    "        JI.het <- read.table(het_file, sep=\"\\t\", header=TRUE)\n",
    "        names_methods <- names(JI.het)\n",
    "        names(JI.het) <- paste0(names(JI.het), \"_HET\")\n",
    "\n",
    "        # Aggregate results\n",
    "        if(exists(\"JI.final\")) {\n",
    "            JI.final <- data.frame(JI.het, JI.final)\n",
    "            new_order <- apply(expand.grid(c(\"_HET\", \"_EQ\"), names_methods)[, c(2,1)], 1, paste, collapse=\"\")\n",
    "            JI.final <- JI.final[, new_order]\n",
    "        }\n",
    "        else {\n",
    "            JI.final <- JI.het\n",
    "        }\n",
    "    }\n",
    "    \n",
    "    # Plot results\n",
    "    par(mar=c(25,4,2,2)+.1)\n",
    "    boxplot(JI.final, xaxt=\"none\", cex.axis=3.5, \n",
    "                 col=c('gray','gray','red','red','blue','blue','blueviolet','blueviolet','deeppink','deeppink','chocolate1','chocolate1','darkgoldenrod','darkgoldenrod','green','green','darkturquoise','darkturquoise'), \n",
    "                 ann=FALSE, outline=FALSE)\n",
    "    matplot(1:ncol(JI.final), t(JI.final), col=\"black\", pch=16, xaxt=\"none\", cex=0.8, add=TRUE)\n",
    "    axis(1, at=1:ncol(JI.final), labels=names(JI.final), \n",
    "         las=2, srt=45, cex=0.8, cex.lab=3.5, cex.axis=3.5, cex.main=1.5, cex.sub=1.5) \n",
    "    title(main=paste(i,\"clusters\",sep=\" \"), \n",
    "          cex.lab=0.75, line = -2.5, adj=0, cex.main=3.5)\n",
    "}\n",
    "dev.off()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}