--- a
+++ b/vignettes/chen_vignette.Rmd
@@ -0,0 +1,360 @@
+---
+title: "Predict GRN from scRNA+scATAC data (Chen 2018 dataset)"
+author: "Trimbour Rémi"
+date: "2023-05-16"
+output:
+  html_document: default
+  github_document: default
+  pdf_document: default
+resource_files:
+  - figures/schema_HuMMuS.png
+  - figures/hummus_object_description.png
+  - figures/build_multilayer.png
+  - figures/5_steps.png
+  - figures/explore_multilayer.png
+---
+
+
+## Useful links:
+*Paper*: https://www.biorxiv.org/content/10.1101/2023.06.09.543828v1
+
+*Github repo*: https://github.com/cantinilab/HuMMuS
+
+*Documentation*: https://cantinilab.github.io/HuMMuS/
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(eval = TRUE)
+devtools::install_github("cantinilab/HuMMuS", ref="dev_SeuratV5")
+```
+# General description of the pipeline
+![Overall pipeline](figures/schema_HuMMuS.png)
+
+## Useful ressources
+*Preprint detailing the method:* [https://www.biorxiv.org/content/10.1101/2023.06.09.543828v1](https://www.biorxiv.org/content/10.1101/2023.06.09.543828v1)
+
+*Github repo detailing the installation:* [https://github.com/cantinilab/HuMMuS](https://github.com/cantinilab/HuMMuS/tree/)
+
+*Documentation and vignette:* [https://cantinilab.github.io/HuMMuS](https://cantinilab.github.io/HuMMuS)
+
+
+### General plan
+##### 0. Preparation of the environment
+##### 1. Initialisation of a hummus object 
+##### 2. Construction of the multilayer
+##### 3. Analyse multilayer and define gene regulatory network (GRN)
+    
+
+
+## 0. Setting up the environment
+```{r import_packages}
+# install python dependency
+envname = "r-reticulate"
+# reticulate::py_install("hummuspy", envname = envname, pip=TRUE)
+reticulate::use_condaenv(envname)
+
+hummuspy <- reticulate::import("hummuspy")
+library(HuMMuS)
+```
+## Download the single-cell data
+The data used in this tutorial can be [downloaded here](https://figshare.com/account/home#/projects/168899)
+
+
+## 1. Initialisation of HuMMuS object
+HuMMuS R objects are instances developed on top of seurat objects. It means it’s created from a seurat object and the contained assays can be accessed the same way.
+
+Additionally, it contains a motifs_db object, providing tf motifs informations, and a multilayer objects, that will be completed while going through this tutorial. It will mostly include :
+
+  - list of multiplex networks (one per modality) 
+  - list of bipartites (one per connection between layers)
+![hummus object schema](figures/hummus_object_description.png)
+
+
+### 1.1. Transform data into a hummus object
+```{r initiate_hummus_object}
+# Load the Chen dataset, which is a Seurat object containing scRNA-seq and scATAC-seq data
+data("chen_dataset_subset")
+chen_dataset_subset
+
+# Create an hummus object from seurat object
+hummus <- Initiate_Hummus_Object(chen_dataset_subset)
+```
+  
+### 1.2. Add genome and motif annotations to hummus object
+Fetch genome annotations online (necessitate an internet connection).
+You can also request any "EnsDB" object adapted to your data 
+(e.g. EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 for human genome annotations)
+ or use your own genome annotations in the same format.
+
+```{r genome_annotations, eval=TRUE, warning=FALSE}
+# get human genome annotation from EndDb data
+# wrapper of Signac::GetGRangesFromEnsDb, adapting output to UCSC format
+genome_annotations <- get_genome_annotations(
+  ensdb_annotations = EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86)
+
+# can also be downloaded, saved as an RDS objects for example
+```
+
+Add genome annotations to hummus/seurat object
+```{r add_genome_annotations, eval=TRUE}
+Signac::Annotation(hummus@assays$peaks) <- genome_annotations
+rm(genome_annotations)
+```
+
+Get TF motifs from JASPAR2020 and chromVARmotifs databsases in a motifs_db
+object. By default, human motifs are used. You can specify the species you want
+to use with the `species` argument (e.g. species = "mouse" for mouse).
+motifs_db objects contain 3 slots : 
+* `motifs = "PWMatrixList"`
+* `tf2motifs = "data.frame"`
+* `tfs = "NULL"`
+PWMatrixList is a named vector of the motif matrices, whil tf2motifs is a
+correspondance table between TFs and motifs. tfs is a named vector of the TFs.
+You can also use your own motifs_db object, as long as it contains the same
+slots.
+
+```{r get_tf2motifs, eval=TRUE}
+# Load TF motifs from JASPAR2020 and chromVARmotifs in hummus object
+hummus@motifs_db <- get_tf2motifs() # by default human motifs
+```
+
+## 2. Construction of the multilayer
+
+![hummus object schema](figures/build_multilayer.png)
+
+You can compute the different layers and bipartites as indicated below.
+
+An example multilayer completed can also be imported with : `data(chen_subset_hummus)`.
+This object corresponds to a multilayer from chen_dataset_subset completed. You can then go to the part 3, replacing `hummus` by `chen_subset_hummus` in each step.
+
+Finally, [you can compute the different layers before, and add them afterwards](add_networks.Rmd).
+It allows to use faster methods to compute the networks
+(e.g. [Arboreto](https://arboreto.readthedocs.io/en/latest/) for the gene network,
+ [ATACNet](https://github.com/r-trimbour/ATACNet) for the peak network, etc.).
+
+### Compute 3 layers and 2 bipartites
+![hummus object schema](figures/5_steps.png)
+
+**!! Long step !!** You can also go directly to the part 3 for your "discovery tour". :)
+
+### 2.1. TF - peaks bipartite reconstruction
+
+TF - peaks bipartite is computed using the motifs_db object and the peak
+assay. You can specify the assay to use to filter TFs (e.g. "RNA" if you want
+to use only the TFs expressed in your dataset). If NULL, all TFs with motifs
+will be used.
+BSGenome object is used to identify location of motifs and intersect them with
+peak 
+<br>You can also specify the name of the bipartite that will be added to the
+ hummus object. By default, it will be named "tf_peak".
+```{r bipartite_tf_peak, eval=TRUE}
+hummus <- bipartite_tfs2peaks(
+              hummus_object = hummus,
+              tf_expr_assay = "RNA", # use to filter TF on only expressed TFs,
+                                     # if NULL, all TFs with motifs are used
+              peak_assay = "peaks",
+              tf_multiplex_name = "TF",
+              genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
+              )
+```
+
+### 2.2. Genes - peaks bipartite reconstruction
+Peaks - genes bipartite is computed 
+```{r bipartite_peaks2genes, eval=TRUE}
+hummus <- bipartite_peaks2genes(
+                      hummus_object = hummus,
+                      gene_assay = "RNA",
+                      peak_assay = "peaks",
+                      store_network = FALSE,
+                      )
+```
+
+#### Compute layer networks and add it to hummus object
+Each one of the three layers is computed individually. 
+
+### 2.3. Compute the TF network from OmniPath database
+We currently use OmniPath R package to fetch TF interactions. 
+You can first specify if you want to use only the TFs expressed in your dataset
+(if you have a RNA assay in your hummus object). If `gene_assay` is NULL, all
+TFs with motifs will be used.
+<br> You can then specify which interactions you want to keep through
+'source_target' argument ("AND" | "OR"). If "AND", only the interactions
+between 2 TFs that are both present in the dataset will be kept. If "OR", all
+interactions involving at least one TF present in the dataset will be kept.
+<br>Finally, you can specify the name of the multiplex and the name of the
+network that will be added to the hummus object.
+The added network will be undirected and unweighted since PPI and OmniPath
+database are not directional nor return any weight here.
+```{r tf_network, eval=TRUE}
+hummus <- compute_tf_network(hummus,
+                            gene_assay = "RNA", # default = None ;
+                                                # If a assay is provided,
+                                                # only the TFs that are present
+                                                # will be considered
+                            verbose = 1,
+                            #source_target = "OR",
+                            multiplex_name = "TF",
+                            tf_network_name = "TF_network")
+```
+
+### 2.4. Compute gene network from scRNA-seq w/ GENIE3
+### *!! This step can be very slow if you have thousands of cells !!*
+_Current recommendation if you have a big dataset is to compute the network_
+_before with [GRNBoost2 thorugh arboreto](https://arboreto.readthedocs.io/en/latest/)_
+_and [add it to the hummus object afterwards](https://cantinilab.github.io/HuMMuS/articles/add_networks.html)._
+Different methods can be used to compute the gene network. For now, only GENIE3
+is implemented in HuMMuS. You can specify which assay to use to compute the
+network (`gene_assay`).
+<br>You can specify the number of cores to use to compute
+the network. You can also specify if you want to save the network locally
+(`store_network = TRUE`) or not (`store_network = FALSE`). If you choose to save
+the network, you will need to specify the output file name (`output_file`).
+The returned network will be considered undirected and weighted. While GENIE3
+returns a directed network, we symmetrize it for the random walk with restart
+exploration of the genes proximity.
+
+```{r gene_network, eval=TRUE}
+hummus <- compute_gene_network(
+              hummus,
+              gene_assay = "RNA",
+              method = "GENIE3",
+              verbose = 1,
+              number_cores = 5, # GENIE3 method can be ran
+                                # parallelised on multiple cores
+              store_network = FALSE, # by default : FALSE, but
+                                     # each network can be saved 
+                                     # when computed with hummus
+              output_file = "gene_network.tsv")
+```
+
+### 2.5. Compute the peak network from scATAC-seq w/ Cicero
+Different methods can be used to compute the peak network. For now, only Cicero
+is implemented in HuMMuS. You can specify which assay to use to compute the network
+(`peak_assay`). You can also specify the number of cores to use to compute the
+network. You can also specify if you want to save the network locally
+(`store_network = TRUE`) or not (`store_network = FALSE`). If you choose to save
+the network, you will need to specify the output file name (`output_file`).
+The returned network will be considered undirected and weighted, since cis-regulatory
+interaction and Cicero outputs are not directional. 
+
+```{r peak_network, eval=TRUE}
+hummus <- compute_atac_peak_network(hummus,
+              atac_assay = "peaks",
+              verbose = 1,
+              genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
+              store_network = FALSE)  
+```
+
+
+## 3. Analyse of the multilayer and definition of GRN
+```{r load_precomputed}
+data(chen_subset_hummus) 
+hummus <- chen_subset_hummus
+```
+
+### 3.1. Save the mulilayer in a classic hierarchical structure
+The package used for the random walk with restart exploration (multixrank)
+ requires currently to save all the network files on disk. To simplify the
+ organisation of the file, it is possible to save everything necessary with
+ function `save_multilayer()`.<br>
+It will create a folder (specified through `folder_name`) containing all the
+ files necessary to run the multixrank algorithm. The folder will contain
+ the following subfolders :
+  * **bipartite** : containing the bipartites files 
+  * **multiplex** : containing the multiplex sub-subfolders
+    * **multiplex_1** (e.g. TF|peak|RNA) : containing the network file
+    of each layer of the multiplex
+  * **seed** : that will contain the seed files (necessary to compute HuMMuS
+    outputs later)
+  * **config** : that will contain the config files (necessary to compute
+    HuMMuS outputs later)
+    
+![hummus object schema](figures/explore_multilayer.png)
+```{r save_multilayer}
+save_multilayer(hummus = hummus,
+                folder_name = "chen_multilayer")
+```
+
+### 3.2. Retrieve target genes
+With HuMMuS, inference of GRN and target gene of TFs are different outputs.
+Indeed, while GRN is computed making TFs compete to regulate genes (by random
+walk with restart starting from the genes and going to the TFs), target genes
+are computed making genes compete to be regulated by TFs (by random walk with
+restart starting from the TFs and going to the genes).<br>
+For target genes output, you can specify the list of TFs (`tf_list`) to use as seed
+(if NULL by default, all TFs will be used as seed). Only the links between
+the seed TFs and the genes will be computed. You can also specify the list of
+genes to use. Only the score of the genes present in the network and the
+`gene_list` will be returned.
+
+```{r target_genes_ATF2, eval=TRUE}
+ATF2_genes <- define_target_genes(
+  hummus,
+  tf_list = list("ATF2"),
+  multilayer_f = "chen_multilayer",
+  njobs = 1
+  )
+```
+```{r head_target_genes_ATF2}
+head(ATF2_genes)
+```
+
+
+```{r target_genes}
+target_genes <- define_target_genes(
+  hummus,
+  multilayer_f = "chen_multilayer",
+  njobs = 1
+  )
+```
+```{r head_target_genes}
+head(target_genes)
+```
+
+### 3.3. Define GRN
+The GRN is defined using the multixrank algorithm. It requires to have
+ the hummuspy python package installed (pip install hummuspy).<br>
+ <br> This can be parallelised using the njobs argument.
+ You can also specify the list of genes and the list of TFs to use.
+
+```{r grn, eval=FALSE}
+grn <- define_grn(
+  hummus,
+  multilayer_f = "chen_multilayer",
+  njobs = 5
+  )
+```
+```{r head_grn, eval=FALSE}
+grn
+```
+
+### 3.4. Retrieve enhancers
+<br> You can also specify the list of peaks to use.
+
+```{r enhancers}
+enhancers <- define_enhancers(
+  hummus,
+  gene_list = list("ATF2"),
+  multilayer_f = "chen_multilayer",
+  njobs = 1
+  )
+```
+```{r head_enhancers}
+head(enhancers)
+```
+
+### 3.5. Retrieve binding regions
+For binding regions output, you can specify the list of TFs (`tf_list`) to use as seed
+(if NULL by default, all TFs will be used as seed). Only the links between
+the seed TFs and the peaks will be computed. You can also specify the list of
+peaks to use. Only the score of the peaks present in the network and the
+`peak_list` will be returned.
+```{r}
+binding_regions <- define_binding_regions(
+  hummus,
+  multilayer_f = "chen_multilayer",
+  njobs = 1
+  )
+```
+```{r}
+head(binding_regions)
+```