[73f552]: / vignettes / netOmics.Rmd

Download this file

686 lines (527 with data), 24.1 kB

---
title: "netOmics"
author: 
- name:  "Antoine Bodein"
  affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada"
  email: "antoine.bodein.1@ulaval.ca"
- name:  "Marie-Pier Scott-Boyer"
  affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada"
- name: "Olivier Perin"
  affiliation: "Digital Sciences Department, L’Oréal Advanced Research, Aulnay-sous-bois, France"
- name: "Kim-Anh Lê Cao"
  affiliation: "Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia"
- name: "Arnaud Droit"
  affiliation: "CHU de Québec Research Center, Université Laval, Molecular Medicine department, Québec, QC, Canada"
package: netOmics  
output: 
  BiocStyle::html_document
  
vignette: >
  %\VignetteIndexEntry{netOmics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  

bibliography: ["mybib.bib"]
biblio-style: apalike
link-citations: true
---

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.align = "center")
```


The emergence of multi-omics data enabled the development of 
integration methods.

With netOmics, we go beyond integration by introducing an interpretation tool.
netOmics is a package for the creation and exploration of multi-omics networks.

Depending on the provided dataset, it allows to create inference networks from 
expression data but also interaction networks from knowledge databases.
After merging the sub-networks to obtain a global multi-omics network, 
we propose network exploration methods using propoagation techniques to perform 
functional prediction or identification of molecular mechanisms.

Furthermore, the package has been developed for longitudinal multi-omics data 
and can be used in conjunction with our previously published package timeOmics.

![Overview](./img/netomics_overview.png)

For more informnation about the method, please check [@bodein2020interpretation]

In this vignette, we introduced a case study which depict the main steps to 
create and explore multi-omics networks from multi-omics time-course data.

# Requirements

```{r,eval=FALSE}
# install the package via BioConductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("netOmics")
```

```{r,eval=FALSE}
# install the package via github
library(devtools)
install_github("abodein/netOmics")
```


```{r, eval=TRUE, message=FALSE}
# load the package
library(netOmics)
```


```{r, eval=TRUE, message=FALSE}
# usefull packages to build this vignette
library(timeOmics)
library(tidyverse)
library(igraph)
```

# Case Study: Human Microbiome Project T2D

The package will be illustrated on longitudinal MO dataset to study 
the seasonality of MO expression in patients with diabetes [@sailani2020deep].

The data used in this vignette is a subset of the data available at: 
https://github.com/aametwally/ipop_seasonal

We focused on a single individual with 7 timepoints.
6 different omics were sampled 
(RNA, proteins, cytokines, gut microbiome, metabolites and clinical variables).

```{r load_data}
# load data
data("hmp_T2D")
```


# (optional: *timeOmics* analysis)

The first step of the analysis is the preprocessing and longitudinal clustering.
This step is carried out with timeOmics and should be reserved for longitudinal 
data.

It ensures that the time profiles are classified into groups of similar profiles
so each MO molecule is labbeled with its cluster.

In addition, timeOmics can identify a multi-omics signature of the clusters.
These molecules can be, for example, the starting points of the propogation 
analysis.

For more informations about *timeOmics*, please 
check http://www.bioconductor.org/packages/release/bioc/html/timeOmics.html

As illustrated in the R chunk below the timeOmics step includes:

* omic-specific preprocessing and longitudinal fold-change filtering
* modelling of expression profiles
* clustering of MO expression profiles
* signature identification by cluster

```{r timeOmics_1, eval=FALSE}
# not evaluated in this vignette

#1 filter fold-change
remove.low.cv <- function(X, cutoff = 0.5){
    # var.coef
    cv <- unlist(lapply(as.data.frame(X), 
                    function(x) abs(sd(x, na.rm = TRUE)/mean(x, na.rm= TRUE))))
    return(X[,cv > cutoff])
}
fc.threshold <- list("RNA"= 1.5, "CLINICAL"=0.2, "GUT"=1.5, "METAB"=1.5,
                     "PROT" = 1.5, "CYTO" = 1)

# --> hmp_T2D$raw
data.filter <- imap(raw, ~{remove.low.cv(.x, cutoff = fc.threshold[[.y]])})

#2 scale
data <- lapply(data.filter, function(x) log(x+1))
# --> hmp_T2D$data


#3 modelling
lmms.func <- function(X){
    time <- rownames(X) %>% str_split("_") %>% 
      map_chr(~.x[[2]]) %>% as.numeric()
    lmms.output <- lmms::lmmSpline(data = X, time = time,
                                   sampleID = rownames(X), deri = FALSE,
                                   basis = "p-spline", numCores = 4, 
                                   keepModels = TRUE)
    return(lmms.output)
}
data.modelled <- lapply(data, function(x) lmms.func(x))

# 4 clustering
block.res <- block.pls(data.modelled, indY = 1, ncomp = 1)
getCluster.res <- getCluster(block.res)
# --> hmp_T2D$getCluster.res


# 5 signature
list.keepX <- list("CLINICAL" = 4, "CYTO" = 3, "GUT" = 10, "METAB" = 3, 
                   "PROT" = 2,"RNA" = 34)
sparse.block.res  <- block.spls(data.modelled, indY = 1, ncomp = 1, scale =TRUE, 
                                keepX =list.keepX)
getCluster.sparse.res <- getCluster(sparse.block.res)
# --> hmp_T2D$getCluster.sparse.res
```

timeOmics resulted in 2 clusters, labelled `1` and `-1`

```{r timeOmics_2}
# clustering results
cluster.info <- hmp_T2D$getCluster.res
```

# Network building

Each layer of the network is built sequentially and then assembled 
in a second section.

All the functions in the package can be used on one element or a list of 
elements.
In the longitudinal context of the data, kinetic cluster sub-networks are built 
plus a global network
(`1`, `-1` and `All`).

## Inference Network

Multi-omics network building starts with a first layer of gene.
Currently, the ARACNe algorithm handles the inference but we will include more 
algorithms in the future.

The function `get_grn` return a Gene Regulatory Network from gene expression 
data. 
Optionally, the user can provide a timeOmics clustering result (`?getCluster`) 
to get cluster specific sub-networks. In this case study, this will 
automatically build the networks (`1`, `-1` and `All`), as indicated previously.

The `get_graph_stats` function provides basic graph statistics such as the 
number of vertices and edges.
If the vertices have different attributes, it also includes a summary of these.

```{r graph.rna, warning=FALSE}
cluster.info.RNA <- timeOmics::getCluster(cluster.info, user.block = "RNA")
graph.rna <- get_grn(X = hmp_T2D$data$RNA, cluster = cluster.info.RNA)

# to get info about the network
get_graph_stats(graph.rna)
```

## Interaction from databases

As for the genes, the second layer is a protein layer (Protein-Protein 
Interaction).
This time, no inference is performed. Instead, known interactions are extracted
from a database of interaction (BIOGRID).

The function `get_interaction_from_database` will fetch the interactions from a 
database provided as a `data.frame` (with columns `from` and `to`) or a graph 
(`igraph` object).
In addition to the interactions between the indicated molecules, the first 
degree neighbors can also be collected (option `user.ego = TRUE`)


```{r PROT_graph, warning=FALSE}
# Utility function to get the molecules by cluster
get_list_mol_cluster <- function(cluster.info, user.block){
  require(timeOmics)
    tmp <- timeOmics::getCluster(cluster.info, user.block) 
    res <- tmp %>% split(.$cluster) %>% 
        lapply(function(x) x$molecule)
    res[["All"]] <- tmp$molecule
    return(res)
}

cluster.info.prot <- get_list_mol_cluster(cluster.info, user.block = 'PROT')
graph.prot <-  get_interaction_from_database(X = cluster.info.prot, 
                                             db = hmp_T2D$interaction.biogrid, 
                                             type = "PROT", user.ego = TRUE)
# get_graph_stats(graph.prot)
```

In this example, only a subset of the Biogrid database is used 
(matching elements).

## Other inference methods

Another way to compute networks from expression data is to use other inference
methods.
In the following chunk, we intend to illustrate the use of the SparCC algorithm
[@friedman2012inferring] on the gut data and how it can be integrate into the 
pipeline. 
(sparcc is not included in this package)


```{r GUT_graph, eval = FALSE}
# not evaluated in this vignette
library(SpiecEasi)

get_sparcc_graph <- function(X, threshold = 0.3){
    res.sparcc <- sparcc(data = X)
    sparcc.graph <- abs(res.sparcc$Cor) >= threshold
    colnames(sparcc.graph) <-  colnames(X)
    rownames(sparcc.graph) <-  colnames(X)
    res.graph <- graph_from_adjacency_matrix(sparcc.graph, 
                                             mode = "undirected") %>% simplify
    return(res.graph)
}

gut_list <- get_list_mol_cluster(cluster.info, user.block = 'GUT')

graph.gut <- list()
graph.gut[["All"]] <- get_sparcc_graph(hmp_T2D$raw$GUT, threshold = 0.3)
graph.gut[["1"]] <- get_sparcc_graph(hmp_T2D$raw$GUT %>% 
                                       dplyr::select(gut_list[["1"]]), 
                                     threshold = 0.3)
graph.gut[["-1"]] <- get_sparcc_graph(hmp_T2D$raw$GUT %>% 
                                        dplyr::select(gut_list[["-1"]]), 
                                      threshold = 0.3)
class(graph.gut) <- "list.igraph"
```

```{r GUT}
graph.gut <- hmp_T2D$graph.gut
# get_graph_stats(graph.gut)
```

## Other examples

For this case study, we complete this first step of network building with the 
missing layers.

```{r CYTO_graph, warning=FALSE}
# CYTO -> from database (biogrid)
cyto_list = get_list_mol_cluster(cluster.info = cluster.info, 
                                 user.block = "CYTO")
graph.cyto <-  get_interaction_from_database(X = cyto_list,
                                             db = hmp_T2D$interaction.biogrid, 
                                             type = "CYTO", user.ego = TRUE)
# get_graph_stats(graph.cyto)

# METAB -> inference
cluster.info.metab <-  timeOmics::getCluster(X = cluster.info, 
                                             user.block = "METAB")
graph.metab <-  get_grn(X = hmp_T2D$data$METAB, 
                        cluster = cluster.info.metab)
# get_graph_stats(graph.metab)

# CLINICAL -> inference
cluster.info.clinical <- timeOmics::getCluster(X = cluster.info, 
                                               user.block = 'CLINICAL')
graph.clinical <- get_grn(X = hmp_T2D$data$CLINICAL,
                          cluster = cluster.info.clinical)
# get_graph_stats(graph.clinical)
```

# Layer merging

We included 2 types of layer merging: 

* *merging with interactions* uses the shared elements between 2 graphs to build
a larger network.
* *merging with correlations* uses the spearman correlation from expression 
profiles between 2 layers when any interaction is known.


## Merging with interactions

The function `combine_layers` enables the fusion of different network layers.
It combines the network (or list of network) in `graph1` with the network
(or list of network) in `graph2`, based on the shared vertices between
the networks.

Additionally, the user can provide an interaction table `interaction.df`
(in the form of a data.frame or igraph object).

In the following chunk, we sequentially merge RNA, PROT and CYTO layers and uses
the TFome information (TF protein -> Target Gene) to connect these layers.

```{r, merged_0}
full.graph <- combine_layers(graph1 = graph.rna, graph2 = graph.prot)
full.graph <- combine_layers(graph1 = full.graph, graph2 = graph.cyto)

full.graph <- combine_layers(graph1 = full.graph,
                             graph2 = hmp_T2D$interaction.TF)
# get_graph_stats(full.graph)
```

## Merging with correlations

To connect omics layers for which no interaction information is available,
we propose to use a threshold on the correlation between the expression profiles
of two or more omics data.

The strategy is as follows: we isolate the omics from the data and calculate
the correlations between this omics and the other data.

```{r merged_1_gut, warning=FALSE}
all_data <- reduce(hmp_T2D$data, cbind)

# omic = gut
gut_list <- get_list_mol_cluster(cluster.info, user.block = "GUT")
omic_data <- lapply(gut_list, function(x)dplyr::select(hmp_T2D$data$GUT, x))

# other data = "RNA", "PROT", "CYTO"
other_data_list <- get_list_mol_cluster(cluster.info,
                                        user.block = c("RNA", "PROT", "CYTO"))
other_data <- lapply(other_data_list, function(x)dplyr::select(all_data, x))

# get interaction between gut data and other data
interaction_df_gut <- get_interaction_from_correlation(X = omic_data,
                                                       Y = other_data,
                                                       threshold = 0.99)

# and merge with full graph
full.graph <- combine_layers(graph1 = full.graph,
                             graph2 = hmp_T2D$graph.gut,
                             interaction.df = interaction_df_gut$All)
```


```{r, merged_2_clinical, warning=FALSE}
# omic =  Clinical
clinical_list <- get_list_mol_cluster(cluster.info, user.block = "CLINICAL")
omic_data <- lapply(clinical_list, 
                    function(x)dplyr::select(hmp_T2D$data$CLINICAL, x))

# other data = "RNA", "PROT", "CYTO", "GUT"
other_data_list <- get_list_mol_cluster(cluster.info,
                                        user.block = c("RNA", "PROT", 
                                                       "CYTO", "GUT"))
other_data <- lapply(other_data_list, function(x)dplyr::select(all_data, x))


# get interaction between gut data and other data
interaction_df_clinical <- get_interaction_from_correlation(X = omic_data
                                                            , Y = other_data,
                                                            threshold = 0.99)

# and merge with full graph
full.graph <- combine_layers(graph1 = full.graph,
                             graph2 = hmp_T2D$graph.clinical, 
                             interaction.df = interaction_df_clinical$All)
```


```{r, merged_3_metab, warning=FALSE}
# omic =  Metab
metab_list <- get_list_mol_cluster(cluster.info, user.block = "METAB")
omic_data <- lapply(metab_list, function(x)dplyr::select(hmp_T2D$data$METAB, x))

# other data = "RNA", "PROT", "CYTO", "GUT", "CLINICAL"
other_data_list <- get_list_mol_cluster(cluster.info,
                                        user.block = c("RNA", "PROT", "CYTO", 
                                                       "GUT", "CLINICAL"))
other_data <- lapply(other_data_list, function(x)dplyr::select(all_data, x))

# get interaction between gut data and other data
interaction_df_metab <- get_interaction_from_correlation(X = omic_data,
                                                         Y = other_data, 
                                                         threshold = 0.99)

# and merge with full graph
full.graph <- combine_layers(graph1 = full.graph, 
                             graph2 = graph.metab, 
                             interaction.df = interaction_df_metab$All)
```

# Addition of supplemental layers

For the interpretation of the MO integration results, the use of additional 
information layers or molecules can be useful to enrich the network.

## Over Representation Analysis

ORA is a common step to include knowledge.
The function `get_interaction_from_ORA` perform the ORA analysis from the 
desired molecules and return an interaction graph with the enriched terms and 
the corresponding molecules.

Then, the interaction graph with the new vertices can be linked to the network 
as illustrated in the previous step.

Here, ORA was performed with RNA, PROT, and CYTO against the Gene Ontology.

```{r}
# ORA by cluster/All
mol_ora <- get_list_mol_cluster(cluster.info, 
                                user.block = c("RNA", "PROT", "CYTO"))

# get ORA interaction graph by cluster
graph.go <- get_interaction_from_ORA(query = mol_ora,
                                     sources = "GO",
                                     organism = "hsapiens",
                                     signif.value = TRUE)

# merge
full.graph <- combine_layers(graph1 = full.graph, graph2 = graph.go)
```

## External knowledge

Additionally, knowledge from external sources can be included in the network.

In the following chunk, we performed disease-related gene enrichment analysis
from *medlineRanker* (http://cbdm-01.zdv.uni-mainz.de/~jfontain/cms/?page_id=4).
We converted the results into a data.frame (with the columns `from` and `to`)
and this acted as an interaction database.

```{r}
# medlineRanker -> database
medlineranker.res.df <- hmp_T2D$medlineranker.res.df %>% 
  dplyr::select(Disease, symbol) %>% 
  set_names(c("from", "to"))
  
mol_list <-  get_list_mol_cluster(cluster.info = cluster.info,
                                  user.block = c("RNA", "PROT", "CYTO"))
graph.medlineranker <-  get_interaction_from_database(X = mol_list,
                                                      db = medlineranker.res.df, 
                                                      type = "Disease",
                                                      user.ego = TRUE)
# get_graph_stats(graph.medlineranker)

# merging
full.graph <- combine_layers(graph1 = full.graph, graph2 = graph.medlineranker)
```

We complete the MO network preparation with attribute cleaning and addition of 
several attributes such as:

* mode = "core" if the vertex was originally present in the data; "extended"
otherwise
* sparse = TRUE if the vertex was present in kinetic cluster signature; FALSE
otherwise
* type = type of omics ("RNA","PROT","CLINICAL","CYTO","GUT","METAB","GO",
"Disease")
* cluster = '1', '-1' or 'NA' (for vertices not originally present in the
original data)

```{r}
# graph cleaning
graph_cleaning <- function(X, cluster.info){
    # no reusability
    X <- igraph::simplify(X)
    va <- vertex_attr(X)
    viewed_mol <- c()
    for(omic in unique(cluster.info$block)){
        mol <- intersect(cluster.info %>% dplyr::filter(.$block == omic) %>%
                           pull(molecule), V(X)$name)
        viewed_mol <- c(viewed_mol, mol)
        X <- set_vertex_attr(graph = X, 
                             name = "type", 
                             index = mol, 
                             value = omic)
        X <- set_vertex_attr(graph = X, 
                             name = "mode",
                             index = mol,
                             value = "core")
    }
    # add medline ranker and go
    mol <- intersect(map(graph.go, ~ as_data_frame(.x)$to) %>%
                       unlist %>% unique(), V(X)$name) # only GO terms
    viewed_mol <- c(viewed_mol, mol)
    X <- set_vertex_attr(graph = X, name = "type", index = mol, value = "GO")
    X <- set_vertex_attr(graph = X, name = "mode", 
                         index = mol, value = "extended")
    
    mol <- intersect(as.character(medlineranker.res.df$from), V(X)$name)
    viewed_mol <- c(viewed_mol, mol)
    X <- set_vertex_attr(graph = X, name = "type",
                         index = mol, value = "Disease")
    X <- set_vertex_attr(graph = X, name = "mode",
                         index = mol, value = "extended")
    
    other_mol <- setdiff(V(X), viewed_mol)
    if(!is_empty(other_mol)){
        X <- set_vertex_attr(graph = X, name = "mode",
                             index = other_mol, value = "extended")
    }
    X <- set_vertex_attr(graph = X, name = "mode", 
                         index = intersect(cluster.info$molecule, V(X)$name), 
                         value = "core")
    
    # signature
    mol <-  intersect(V(X)$name, hmp_T2D$getCluster.sparse.res$molecule)
    X <- set_vertex_attr(graph = X, name = "sparse", index = mol, value = TRUE)
    mol <-  setdiff(V(X)$name, hmp_T2D$getCluster.sparse.res$molecule)
    X <- set_vertex_attr(graph = X, name = "sparse", index = mol, value = FALSE)
    
    return(X)
}
```


```{r}
FULL <- lapply(full.graph, function(x) graph_cleaning(x, cluster.info))
get_graph_stats(FULL)
```

# Network exploration

## Basics network exploration

We can use basic graph statistics to explore the network such as degree 
distribution, modularity, and short path.

```{r, eval = FALSE}
# degree analysis
d <- degree(FULL$All)
hist(d)
d[max(d)]

# modularity # Warnings: can take several minutes
res.mod <- walktrap.community(FULL$All)
# ...

# modularity
sp <- shortest.paths(FULL$All)
```

## Random walk with restart

RWR is a powerful tool to explore the MO networks which simulates a particle 
that randomly walk on the network.
From a starting point (`seed`) it ranks the other vertices based on their
proximity with the seed and the network structure.

We use RWR for function prediction and molecular mechanism identification.

In the example below, the seeds were the GO terms vertices.

```{r}
# seeds = all vertices -> takes 5 minutes to run on regular computer
# seeds <- V(FULL$All)$name
# rwr_res <- random_walk_restart(FULL, seeds)

# seed = some GO terms
seeds <- head(V(FULL$All)$name[V(FULL$All)$type == "GO"])
rwr_res <- random_walk_restart(FULL, seeds)
```

### Find vertices with specific attributes

After the RWR analysis, we implemented several functions to extract valuable
information.

To identify MO molecular functions, the seed can be a GO term and we are 
interested to identify vertices with different omics type within the 
closest nodes.

The function `rwr_find_seeds_between_attributes` can identify which seeds were 
able to reach vertices with different attributes (ex: `type`) within the 
closest `k` (ex: `15`) vertices.

The function `summary_plot_rwr_attributes` displays the number of different 
values for a seed attribute as a bar graph.

```{r}
rwr_type_k15 <- rwr_find_seeds_between_attributes(X = rwr_res, 
                                                  attribute = "type", k = 15)

# a summary plot function
summary_plot_rwr_attributes(rwr_type_k15)
summary_plot_rwr_attributes(rwr_type_k15$All)
```

Alternatively, we can be interested to find functions or molecules which 
link different kinetic cluster (to find regulatory mechanisms).

```{r}
rwr_type_k15 <- rwr_find_seeds_between_attributes(X = rwr_res$All, 
                                                  attribute = "cluster", k = 15)
summary_plot_rwr_attributes(rwr_type_k15)
```

A RWR subnetworks can also be displayed with `plot_rwr_subnetwork` 
from a specific seed.
```{r}
sub_res <- rwr_type_k15$`GO:0005737`
sub <- plot_rwr_subnetwork(sub_res, legend = TRUE, plot = TRUE)
```

### Function prediction

Finally, RWR can also be used for function prediction.
From an annotated genes, the predicted function can be the closest vertex of the
type "GO".

We generalized this principle to identify, from a seed of interest, the closest
node (or `top` closest nodes) with specific attributes and value.

In the example below, the gene "ZNF263" is linked to the 5 closest nodes of 
type = 'GO' and type = 'Disease'.

```{r}
rwr_res <- random_walk_restart(FULL$All, seed = "ZNF263")

# closest GO term
rwr_find_closest_type(rwr_res, seed = "ZNF263", attribute = "type", 
                      value = "GO", top = 5)

# closest Disease
rwr_find_closest_type(rwr_res, seed = "ZNF263", attribute = "type", 
                      value = "Disease", top = 5)

# closest nodes with an attribute "cluster" and the value "-1"
rwr_find_closest_type(rwr_res, seed = "ZNF263", attribute = "cluster",
                      value = "-1", top = 5)
```


```{r, eval = FALSE}
seeds <- V(FULL$All)$name[V(FULL$All)$type %in% c("GO", "Disease")]
```

```{r}
sessionInfo()
```

# References