[9987e3]: / vignettes / chen_vignette.Rmd

Download this file

361 lines (304 with data), 14.2 kB

---
title: "Predict GRN from scRNA+scATAC data (Chen 2018 dataset)"
author: "Trimbour Rémi"
date: "2023-05-16"
output:
  html_document: default
  github_document: default
  pdf_document: default
resource_files:
  - figures/schema_HuMMuS.png
  - figures/hummus_object_description.png
  - figures/build_multilayer.png
  - figures/5_steps.png
  - figures/explore_multilayer.png
---


## Useful links:
*Paper*: https://www.biorxiv.org/content/10.1101/2023.06.09.543828v1

*Github repo*: https://github.com/cantinilab/HuMMuS

*Documentation*: https://cantinilab.github.io/HuMMuS/

```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = TRUE)
devtools::install_github("cantinilab/HuMMuS", ref="dev_SeuratV5")
```
# General description of the pipeline
![Overall pipeline](figures/schema_HuMMuS.png)

## Useful ressources
*Preprint detailing the method:* [https://www.biorxiv.org/content/10.1101/2023.06.09.543828v1](https://www.biorxiv.org/content/10.1101/2023.06.09.543828v1)

*Github repo detailing the installation:* [https://github.com/cantinilab/HuMMuS](https://github.com/cantinilab/HuMMuS/tree/)

*Documentation and vignette:* [https://cantinilab.github.io/HuMMuS](https://cantinilab.github.io/HuMMuS)


### General plan
##### 0. Preparation of the environment
##### 1. Initialisation of a hummus object 
##### 2. Construction of the multilayer
##### 3. Analyse multilayer and define gene regulatory network (GRN)
    


## 0. Setting up the environment
```{r import_packages}
# install python dependency
envname = "r-reticulate"
# reticulate::py_install("hummuspy", envname = envname, pip=TRUE)
reticulate::use_condaenv(envname)

hummuspy <- reticulate::import("hummuspy")
library(HuMMuS)
```
## Download the single-cell data
The data used in this tutorial can be [downloaded here](https://figshare.com/account/home#/projects/168899)


## 1. Initialisation of HuMMuS object
HuMMuS R objects are instances developed on top of seurat objects. It means it’s created from a seurat object and the contained assays can be accessed the same way.

Additionally, it contains a motifs_db object, providing tf motifs informations, and a multilayer objects, that will be completed while going through this tutorial. It will mostly include :

  - list of multiplex networks (one per modality) 
  - list of bipartites (one per connection between layers)
![hummus object schema](figures/hummus_object_description.png)


### 1.1. Transform data into a hummus object
```{r initiate_hummus_object}
# Load the Chen dataset, which is a Seurat object containing scRNA-seq and scATAC-seq data
data("chen_dataset_subset")
chen_dataset_subset

# Create an hummus object from seurat object
hummus <- Initiate_Hummus_Object(chen_dataset_subset)
```
  
### 1.2. Add genome and motif annotations to hummus object
Fetch genome annotations online (necessitate an internet connection).
You can also request any "EnsDB" object adapted to your data 
(e.g. EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 for human genome annotations)
 or use your own genome annotations in the same format.

```{r genome_annotations, eval=TRUE, warning=FALSE}
# get human genome annotation from EndDb data
# wrapper of Signac::GetGRangesFromEnsDb, adapting output to UCSC format
genome_annotations <- get_genome_annotations(
  ensdb_annotations = EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86)

# can also be downloaded, saved as an RDS objects for example
```

Add genome annotations to hummus/seurat object
```{r add_genome_annotations, eval=TRUE}
Signac::Annotation(hummus@assays$peaks) <- genome_annotations
rm(genome_annotations)
```

Get TF motifs from JASPAR2020 and chromVARmotifs databsases in a motifs_db
object. By default, human motifs are used. You can specify the species you want
to use with the `species` argument (e.g. species = "mouse" for mouse).
motifs_db objects contain 3 slots : 
* `motifs = "PWMatrixList"`
* `tf2motifs = "data.frame"`
* `tfs = "NULL"`
PWMatrixList is a named vector of the motif matrices, whil tf2motifs is a
correspondance table between TFs and motifs. tfs is a named vector of the TFs.
You can also use your own motifs_db object, as long as it contains the same
slots.

```{r get_tf2motifs, eval=TRUE}
# Load TF motifs from JASPAR2020 and chromVARmotifs in hummus object
hummus@motifs_db <- get_tf2motifs() # by default human motifs
```

## 2. Construction of the multilayer

![hummus object schema](figures/build_multilayer.png)

You can compute the different layers and bipartites as indicated below.

An example multilayer completed can also be imported with : `data(chen_subset_hummus)`.
This object corresponds to a multilayer from chen_dataset_subset completed. You can then go to the part 3, replacing `hummus` by `chen_subset_hummus` in each step.

Finally, [you can compute the different layers before, and add them afterwards](add_networks.Rmd).
It allows to use faster methods to compute the networks
(e.g. [Arboreto](https://arboreto.readthedocs.io/en/latest/) for the gene network,
 [ATACNet](https://github.com/r-trimbour/ATACNet) for the peak network, etc.).

### Compute 3 layers and 2 bipartites
![hummus object schema](figures/5_steps.png)

**!! Long step !!** You can also go directly to the part 3 for your "discovery tour". :)

### 2.1. TF - peaks bipartite reconstruction

TF - peaks bipartite is computed using the motifs_db object and the peak
assay. You can specify the assay to use to filter TFs (e.g. "RNA" if you want
to use only the TFs expressed in your dataset). If NULL, all TFs with motifs
will be used.
BSGenome object is used to identify location of motifs and intersect them with
peak 
<br>You can also specify the name of the bipartite that will be added to the
 hummus object. By default, it will be named "tf_peak".
```{r bipartite_tf_peak, eval=TRUE}
hummus <- bipartite_tfs2peaks(
              hummus_object = hummus,
              tf_expr_assay = "RNA", # use to filter TF on only expressed TFs,
                                     # if NULL, all TFs with motifs are used
              peak_assay = "peaks",
              tf_multiplex_name = "TF",
              genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
              )
```

### 2.2. Genes - peaks bipartite reconstruction
Peaks - genes bipartite is computed 
```{r bipartite_peaks2genes, eval=TRUE}
hummus <- bipartite_peaks2genes(
                      hummus_object = hummus,
                      gene_assay = "RNA",
                      peak_assay = "peaks",
                      store_network = FALSE,
                      )
```

#### Compute layer networks and add it to hummus object
Each one of the three layers is computed individually. 

### 2.3. Compute the TF network from OmniPath database
We currently use OmniPath R package to fetch TF interactions. 
You can first specify if you want to use only the TFs expressed in your dataset
(if you have a RNA assay in your hummus object). If `gene_assay` is NULL, all
TFs with motifs will be used.
<br> You can then specify which interactions you want to keep through
'source_target' argument ("AND" | "OR"). If "AND", only the interactions
between 2 TFs that are both present in the dataset will be kept. If "OR", all
interactions involving at least one TF present in the dataset will be kept.
<br>Finally, you can specify the name of the multiplex and the name of the
network that will be added to the hummus object.
The added network will be undirected and unweighted since PPI and OmniPath
database are not directional nor return any weight here.
```{r tf_network, eval=TRUE}
hummus <- compute_tf_network(hummus,
                            gene_assay = "RNA", # default = None ;
                                                # If a assay is provided,
                                                # only the TFs that are present
                                                # will be considered
                            verbose = 1,
                            #source_target = "OR",
                            multiplex_name = "TF",
                            tf_network_name = "TF_network")
```

### 2.4. Compute gene network from scRNA-seq w/ GENIE3
### *!! This step can be very slow if you have thousands of cells !!*
_Current recommendation if you have a big dataset is to compute the network_
_before with [GRNBoost2 thorugh arboreto](https://arboreto.readthedocs.io/en/latest/)_
_and [add it to the hummus object afterwards](https://cantinilab.github.io/HuMMuS/articles/add_networks.html)._
Different methods can be used to compute the gene network. For now, only GENIE3
is implemented in HuMMuS. You can specify which assay to use to compute the
network (`gene_assay`).
<br>You can specify the number of cores to use to compute
the network. You can also specify if you want to save the network locally
(`store_network = TRUE`) or not (`store_network = FALSE`). If you choose to save
the network, you will need to specify the output file name (`output_file`).
The returned network will be considered undirected and weighted. While GENIE3
returns a directed network, we symmetrize it for the random walk with restart
exploration of the genes proximity.

```{r gene_network, eval=TRUE}
hummus <- compute_gene_network(
              hummus,
              gene_assay = "RNA",
              method = "GENIE3",
              verbose = 1,
              number_cores = 5, # GENIE3 method can be ran
                                # parallelised on multiple cores
              store_network = FALSE, # by default : FALSE, but
                                     # each network can be saved 
                                     # when computed with hummus
              output_file = "gene_network.tsv")
```

### 2.5. Compute the peak network from scATAC-seq w/ Cicero
Different methods can be used to compute the peak network. For now, only Cicero
is implemented in HuMMuS. You can specify which assay to use to compute the network
(`peak_assay`). You can also specify the number of cores to use to compute the
network. You can also specify if you want to save the network locally
(`store_network = TRUE`) or not (`store_network = FALSE`). If you choose to save
the network, you will need to specify the output file name (`output_file`).
The returned network will be considered undirected and weighted, since cis-regulatory
interaction and Cicero outputs are not directional. 

```{r peak_network, eval=TRUE}
hummus <- compute_atac_peak_network(hummus,
              atac_assay = "peaks",
              verbose = 1,
              genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
              store_network = FALSE)  
```


## 3. Analyse of the multilayer and definition of GRN
```{r load_precomputed}
data(chen_subset_hummus) 
hummus <- chen_subset_hummus
```

### 3.1. Save the mulilayer in a classic hierarchical structure
The package used for the random walk with restart exploration (multixrank)
 requires currently to save all the network files on disk. To simplify the
 organisation of the file, it is possible to save everything necessary with
 function `save_multilayer()`.<br>
It will create a folder (specified through `folder_name`) containing all the
 files necessary to run the multixrank algorithm. The folder will contain
 the following subfolders :
  * **bipartite** : containing the bipartites files 
  * **multiplex** : containing the multiplex sub-subfolders
    * **multiplex_1** (e.g. TF|peak|RNA) : containing the network file
    of each layer of the multiplex
  * **seed** : that will contain the seed files (necessary to compute HuMMuS
    outputs later)
  * **config** : that will contain the config files (necessary to compute
    HuMMuS outputs later)
    
![hummus object schema](figures/explore_multilayer.png)
```{r save_multilayer}
save_multilayer(hummus = hummus,
                folder_name = "chen_multilayer")
```

### 3.2. Retrieve target genes
With HuMMuS, inference of GRN and target gene of TFs are different outputs.
Indeed, while GRN is computed making TFs compete to regulate genes (by random
walk with restart starting from the genes and going to the TFs), target genes
are computed making genes compete to be regulated by TFs (by random walk with
restart starting from the TFs and going to the genes).<br>
For target genes output, you can specify the list of TFs (`tf_list`) to use as seed
(if NULL by default, all TFs will be used as seed). Only the links between
the seed TFs and the genes will be computed. You can also specify the list of
genes to use. Only the score of the genes present in the network and the
`gene_list` will be returned.

```{r target_genes_ATF2, eval=TRUE}
ATF2_genes <- define_target_genes(
  hummus,
  tf_list = list("ATF2"),
  multilayer_f = "chen_multilayer",
  njobs = 1
  )
```
```{r head_target_genes_ATF2}
head(ATF2_genes)
```


```{r target_genes}
target_genes <- define_target_genes(
  hummus,
  multilayer_f = "chen_multilayer",
  njobs = 1
  )
```
```{r head_target_genes}
head(target_genes)
```

### 3.3. Define GRN
The GRN is defined using the multixrank algorithm. It requires to have
 the hummuspy python package installed (pip install hummuspy).<br>
 <br> This can be parallelised using the njobs argument.
 You can also specify the list of genes and the list of TFs to use.

```{r grn, eval=FALSE}
grn <- define_grn(
  hummus,
  multilayer_f = "chen_multilayer",
  njobs = 5
  )
```
```{r head_grn, eval=FALSE}
grn
```

### 3.4. Retrieve enhancers
<br> You can also specify the list of peaks to use.

```{r enhancers}
enhancers <- define_enhancers(
  hummus,
  gene_list = list("ATF2"),
  multilayer_f = "chen_multilayer",
  njobs = 1
  )
```
```{r head_enhancers}
head(enhancers)
```

### 3.5. Retrieve binding regions
For binding regions output, you can specify the list of TFs (`tf_list`) to use as seed
(if NULL by default, all TFs will be used as seed). Only the links between
the seed TFs and the peaks will be computed. You can also specify the list of
peaks to use. Only the score of the peaks present in the network and the
`peak_list` will be returned.
```{r}
binding_regions <- define_binding_regions(
  hummus,
  multilayer_f = "chen_multilayer",
  njobs = 1
  )
```
```{r}
head(binding_regions)
```