[3bfed4]: / vignettes / Introduction_to_INDEED.Rmd

Download this file

131 lines (93 with data), 10.0 kB

---
title: "INDEED R package for cancer biomarker discovery"
author: "Yiming Zuo and Kian Ghaffari"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{INDEED R package for cancer biomarker discovery}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


## Introduction
Differential expression (DE) analysis is commonly used to identify biomarker candidates that have significant changes in their expression levels between distinct biological groups. One drawback of DE analysis is that it only considers the changes on single biomolecular level. In differential network (DN) analysis, network is typically built based on the correlation and biomarker candidates are selected by investigating the network topology. However, correlation tends to generate over-complicated networks and the selection of biomarker candidates purely based on network topology ignores the changes on single biomolecule level. Thus, we have proposed a novel method INDEED, which considers both the changes on single biomolecular and network levels by integrating DE and DN analysis. INDEED has been published in Methods journal ([PMID: 27592383](https://www.ncbi.nlm.nih.gov/pubmed/?term=27592383%5Buid%5D)). This is the R package that implements the algorithm.

This R package will generate a list of dataframes containing information such as p-values, node degree and activity score for each biomolecule. A higher activity score indicates that the corresponding biomolecule has more neighbors connected in the differential network and their p-values are more statistically significant. It will also generate a network display to aid users' biomarker selection.


## Installation

You can install INDEED from github with:

```{r, eval = F}
# install.packages("devtools")
devtools::install_github("ressomlab/INDEED")
```


## Load package

Load the package.

```{r, eval = F}
# load INDEED
library(INDEED)
```


## Testing dataset

A testing dataset has been provided to the users to get familiar with INDEED R package. It contains the expression levels of 39 metabolites from 120 subjects (CIRR: 60; HCC: 60) with CIRR group named as group 0 and HCC group named as group 1.

```{r dataset, eval = F}
# Data matrix contains the expression levels of 39 metabolites from 120 subjects 
# (6 metabolites and 10 subjects are shown)
head(Met_GU[, 1:10])
# Group label for each subject (40 subjects are shown)
Met_Group_GU[1:40]
# Metabolite KEGG IDs (10 metabolites are shown)
Met_name_GU[1:10]
```


## non-partial correlation data analysis function `non_partial_cor()`

- (**data**) This is a p*n dataframe that contains the expression levels for all biomolecules and samples.
- (**class_label**) This is a 1*n dataframe that contains the class label with 0 for group 1 and 1 for group 2.
- (**id**) This is a p*1 dataframe that contains the ID for each biomolecule.
- (**method**) This is a character string indicating which correlation method is to use. The options are either "pearson" as the default or "spearman".
- (**p_val**) This is optional. It is a p*1 dataframe that contains the p-value for each biomolecule from DE analysis.
- (**permutation**) This is a positive integer representing the desired number of permutations. The default is 1000.
- (**permutation_thres**) This is a integer representing the threshold for the permutation test. The default is 0.05 to achieve 95 percent confidence.
- (**fdr**) This is a boolean value indicating whether to apply multiple testing correction (TRUE) or not (FALSE). The default is FALSE. However, if users find the output network is too sparse even after relaxing the permutation_thres, it's probably a good idea to turn off the multiple testing correction.

In non partial correlation method, users only need to run `non_partial_cor()` function. Result will be saved in a list of two dataframes: activity_score and diff_network. activity_score dataframe contains biomolecules ranked by activity score calculated from p-value and node degree. diff_network dataframe contains binary and weight connections for network display.

The following example demonstrates how to use `non_partial_cor()` function:

```{r, eval = F}
result <- non_partial_cor(data = Met_GU, class_label = Met_Group_GU, id = Met_name_GU, method = "pearson", 
                          p_val = pvalue_M_GU, permutation = 1000, permutation_thres = 0.05, fdr = TRUE)
```


## partial correlation data preprocessing function `select_rho_partial()`

- (**data**) This is a p*n dataframe that contains the expression levels for all biomolecules and samples.
- (**class_label**) This is a 1*n dataframe that contains the class label with 0 for group 1 and 1 for group 2.
- (**id**) This is a p*1 dataframe that contains the ID for each biomolecule.
- (**error_curve**) This is a boolean value indicating whether to plot the error curve (TRUE) or not (FALSE). The default is TRUE.


## partial correlation data analysis function `partial_cor()`
- (**data_list**) This is a list of pre-processed data outputted by the select_rho_partial function.
- (**rho_group1**) This is a character string indicating the rule for choosing rho value for group 1, "min": minimum rho, "ste": one standard error from minimum, or user can input rho of their choice. The default is minimum.
- (**rho_group2**) This is a character string indicating the rule for choosing rho value for group 2, "min": minimum rho, "ste": one standard error from minimum, or user can input rho of their choice, the default is minimum.
- (**p_val**) This is optional. It is a p*1 dataframe that contains the p-value for each biomolecule from DE analysis.
- (**permutation**) This is a positive integer representing the desired number of permutations. The default is 1000.
- (**permutation_thres**) permutation_thres This is a integer representing the threshold for the permutation test. The default is 0.05 to achieve 95 percent confidence.
- (**fdr**) This is a boolean value indicating whether to apply multiple testing correction (TRUE) or not (FALSE). The default is TRUE. However, if users find the output network is too sparse even after relaxing the permutation_thres, it's probably a good idea to turn off the multiple testing correction.

In partial correlation method, users will need to preprocess the data using `select_rho_partial()` function, and then apply `partial_cor()` function to complete the analysis. Users can provide a p-value table from their DE analysis in `partial_cor()` function. Result will be saved in a list of two dataframes: activity_score and diff_network. activity_score dataframe contains biomolecules ranked by activity score calculated from p-value and node degree. diff_network dataframe contains binary and weight connections for network display.

The following example demonstrates how to use `select_rho_partial()` and `partial_cor()`function:

```{r, eval = F}
pre_data <- select_rho_partial(data = Met_GU, class_label = Met_Group_GU, id = Met_name_GU,
                               error_curve = TRUE)
result <- partial_cor(data_list = pre_data, rho_group1 = 'min', rho_group2 = "min", p_val = pvalue_M_GU,
                      permutation = 1000, permutation_thres = 0.05, fdr = TRUE)
```

In this example, the sparse differential network is based on partial correlation. p-value for each biomolecule is provided from users. rho is selected based on minimum rule. The number of permutations is set to 1000. The threshold is 0.05. Multiple testing correction is applied.


## Interactive Network Visualization function `network_display()`

- (**result**) This is the result from calling either non_partial_corr() or partial_corr(). 
- (**nodesize**) This parameter determines what the size of each node will represent. The options are 'Node_Degree', 'Activity_Score','P_Value' and 'Z_Score'. The title of the resulting network will identify which parameter is selected to represent the node size. The default is Node_Degree.
- (**nodecolor**) This parameter determines what color each node will be based on a yellow to blue color gradient. The options are 'Node_Degree', 'Activity_Score', 'P_Value', and 'Z_Score'. A color bar will be created based on which parameter is chosen. The default is Activity_Score.
- (**edgewidth**) This is a boolean value to indicate whether the edgewidth should be representative of the weight connection (TRUE) or not (FALSE). The default is FALSE.
- (**layout**) Users can choose from a a handful of network visualization templates including: 'nice', 'sphere', 'grid', 'star', and 'circle'. The default is nice.

This is an interactive function to assist in the visualization of the result from INDEED functions non_partial_corr() or patial_corr(). The size and the color of each node can be adjusted by users to represent either the Node_Degree, Activity_Score, Z_Score, or P_Value. The color of the edge is based on the binary value of either 1 corresponding to a positive correlation depicted as green or a negative correlation of -1 depicted as red. Users also have the option of having the width of each edge be proportional to its weight value. The layout of the network can also be customized by choosing from the options: 'nice', 'sphere', 'grid', 'star', and 'circle'. Nodes can be moved and zoomed in on. Each node and edge will display extra information when clicked on. Secondary interactions will be highlighted as well when a node is clicked on. 

The following example demonstrates how to use the `network_display()` function:

```{r, eval = F}
result <- non_partial_cor(data = Met_GU, class_label = Met_Group_GU, id = Met_name_GU, method = "pearson",
                          p_val = pvalue_M_GU, permutation = 1000, permutation_thres = 0.05, fdr = FALSE)
network_display(result = result, nodesize= 'Node_Degree', nodecolor= 'Activity_Score', 
                edgewidth= FALSE, layout= 'nice')
```