--- a
+++ b/docs/source/vignettes/TF enrichment.ipynb
@@ -0,0 +1,545 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Motif and Transcription Factor enrichment after Mowgli integration"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook sums up guidelines to: \n",
+    "\n",
+    "* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n",
+    "* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n",
+    "* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n",
+    "\n",
+    "**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
+    "\n",
+    "**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n",
+    "\n",
+    "**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extract top features of a modality from Mowgli"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We assume that the results of Mowgli integration are in a `mdata` object that store genes in the `rna` and peaks in the `atac` slot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Python\n",
+    "# method to extract the top features\n",
+    "def top_mowgli(dim, n, H_mowgli):\n",
+    "    \"\"\"\n",
+    "    Get the top n peaks for a given dimension.\n",
+    "    \"\"\"\n",
+    "    H_scaled = H_mowgli / H_mowgli.sum(axis=1, keepdims=True)\n",
+    "    return H_scaled[:, dim].argsort()[::-1][:n]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extract the top peaks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Python\n",
+    "# Peaks\n",
+    "\n",
+    "# we set the number of peaks to look at\n",
+    "n_peaks = 100\n",
+    "\n",
+    "# Get the genes or peak dictionaries\n",
+    "H_mowgli_atac = mdata[\"atac\"].uns[\"H_OT\"]\n",
+    "\n",
+    "# actual features extraction\n",
+    "mdata[\"atac\"].var_names = mdata[\"atac\"].var_names.str.replace(\"atac:\", \"\")\n",
+    "top_in_mowgli = mdata[\"atac\"].var.copy()\n",
+    "\n",
+    "# Fill the Mowgli top peaks.\n",
+    "for dim in range(H_mowgli_atac.shape[1]):\n",
+    "    col_name = f\"top_in_dim_{dim}\"\n",
+    "    idx = top_in_mowgli.index[top_mowgli(dim, n_peaks, H_mowgli_atac)]\n",
+    "    top_in_mowgli[col_name] = False\n",
+    "    top_in_mowgli.loc[idx, col_name] = True\n",
+    "\n",
+    "# Save Mowgli's top peaks.\n",
+    "top_in_mowgli.to_csv(\"top_peaks_in_mowgli.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extract the top genes (for other expression-space based enrchments)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Python\n",
+    "# we set the number of peaks to look at\n",
+    "n_genes = 100\n",
+    "\n",
+    "H_mowgli_rna = mdata[\"rna\"].uns[\"H_OT\"]\n",
+    "\n",
+    "# select the top genes to probe using only the highly variable genes (our universe)\n",
+    "top_in_mowgli = (\n",
+    "    mdata[\"rna\"].var.loc[mdata[\"rna\"].var[\"highly_variable\"] == True, :].copy()\n",
+    ")  # the var coordinates\n",
+    "\n",
+    "for dim in range(H_mowgli_rna.shape[1]):\n",
+    "    print(dim)\n",
+    "    # name of the column iun the var object that will be used to extract the top peaks for each gfiven dimenssion\n",
+    "    col_name = f\"top_in_dim_{dim}\"\n",
+    "    idx = top_in_mowgli.index[\n",
+    "        top_mowgli(dim, n_genes, H_mowgli_rna)\n",
+    "    ]  # indices of the top features for that given dimensions. will be used for localizing the peaks afterwasrds\n",
+    "    top_in_mowgli[col_name] = False  # set all value for that dimesions to False\n",
+    "    top_in_mowgli.loc[idx, col_name] = True  # set to True only the peaks that are\n",
+    "\n",
+    "# Save Mowgli's top genes.\n",
+    "top_in_mowgli.to_csv(\n",
+    "    os.path.join(top_feats_dir, f\"top_genes_in mowgli.csv\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Motif enrichment \n",
+    "\n",
+    "This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C).  This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Imports.\n",
+    "library(GenomicRanges)\n",
+    "library(motifmatchr)\n",
+    "library(chromVAR)\n",
+    "library(TFBSTools)\n",
+    "library(JASPAR2022)\n",
+    "library(Signac)\n",
+    "library(BSgenome.Hsapiens.UCSC.hg38)\n",
+    "library(chromVARmotifs)\n",
+    "library(MuData)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Read atac file.\n",
+    "in_atac <- \"top_peaks_in_mowgli.csv\" # nolint\n",
+    "peaks_csv <- read.csv(in_atac, row.names = 2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Optional: Remove non-canonical chromosomes.\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000194.1\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270721.1\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n",
+    "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270713.1\",]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Convert the peaks to GRanges.\n",
+    "chromosomes <- peaks_csv[\"Chromosome\"][, 1]\n",
+    "ranges <- IRanges::IRanges(\n",
+    "    start = peaks_csv[\"Start\"][, 1],\n",
+    "    end = peaks_csv[\"End\"][, 1]\n",
+    ")\n",
+    "peaks <- GenomicRanges::GRanges(seqnames = chromosomes, ranges = ranges)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Get JASPAR motifs.\n",
+    "opts <- list()\n",
+    "opts[\"species\"] <- \"Homo sapiens\"\n",
+    "opts[\"collection\"] <- \"CORE\"\n",
+    "motifs <- TFBSTools::getMatrixSet(JASPAR2022::JASPAR2022, opts)\n",
+    "motifs_pwm <- TFBSTools::toPWM(motifs)\n",
+    "\n",
+    "# Get cisBP motifs.\n",
+    "data(\"human_pwms_v2\")\n",
+    "\n",
+    "# Fuse JASPAR and cisBP motifs.\n",
+    "for (name in names(motifs_pwm)) {\n",
+    "    human_pwms_v2[name] <- motifs_pwm[name]\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Create a 'dummy' Signac object from the peaks.\n",
+    "# Actually giving peaks_csv is nonsense.\n",
+    "# But we only care about the rownames so it's fine.\n",
+    "assay <- Signac::CreateChromatinAssay(\n",
+    "    peaks_csv,\n",
+    "    ranges = peaks,\n",
+    "    sep = c(\":\", \"-\")\n",
+    ")\n",
+    "\n",
+    "# Create statistics about peaks.\n",
+    "assay <- Signac::RegionStats(\n",
+    "    object = assay,\n",
+    "    genome = BSgenome.Hsapiens.UCSC.hg38\n",
+    ")\n",
+    "\n",
+    "# Add the downloaded motif PWM annotation.\n",
+    "assay <- Signac::AddMotifs(\n",
+    "    object = assay,\n",
+    "    genome = BSgenome.Hsapiens.UCSC.hg38,\n",
+    "    pfm = human_pwms_v2\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "# Define where to save the motif enrichment outputs.\n",
+    "out_motif <- \"motifs_\"\n",
+    "\n",
+    "# Get all top peaks.\n",
+    "background <- c()\n",
+    "for (dim in 0:49) {\n",
+    "\n",
+    "    # Get the top peaks for that dimension.\n",
+    "    features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n",
+    "\n",
+    "    background <- c(background, features)\n",
+    "}\n",
+    "\n",
+    "# Iterate over Mowgli's dimensions.\n",
+    "for (dim in 0:49) {\n",
+    "\n",
+    "    # Get the top peaks for that dimension.\n",
+    "    features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n",
+    "\n",
+    "    # Do motif enrichment analysis.\n",
+    "    enriched_motifs <- Signac::FindMotifs(\n",
+    "        object = assay,\n",
+    "        features = features,\n",
+    "        background = background\n",
+    "    )\n",
+    "\n",
+    "    # Save the enrichment.\n",
+    "    write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "source": [
+    "## TF Enrichment using top genes in mowgli dimensions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "source": [
+    "We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n",
+    "\n",
+    "In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Reload libraries\n",
+    "\n",
+    "library(stats)\n",
+    "library(dplyr)\n",
+    "library(stringr)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R \n",
+    "\n",
+    "# reload GRN and make a TF-> Target list\n",
+    "\n",
+    "# network of epithelial cells\n",
+    "grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n",
+    "grn <- read.table(grn.path, sep=\"\\t\", header = F)\n",
+    "colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n",
+    "\n",
+    "# make a TF -> Target list\n",
+    "tf_list <- split(grn$Target, grn$TF)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "\n",
+    "# reload top features for RNA and reparse it\n",
+    "top_feats.path <- \"top_genes_in_mowgli.csv\"\n",
+    "top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n",
+    "# set row names to index\n",
+    "row.names(top_feats) <- top_feats$hgnc_symbol\n",
+    "\n",
+    "cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n",
+    "\n",
+    "top_feats.filtered <- top_feats %>%\n",
+    "  select(all_of(cols_to_keep))\n",
+    "\n",
+    "top_feats.filtered <- top_feats.filtered %>%\n",
+    "  mutate(\n",
+    "    `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n",
+    "    across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n",
+    "  )\n",
+    "\n",
+    "# define the universe -> in this case, a list of highly variable genes in the RNA slot\n",
+    "universe <- readLines(\"highly_variable_genes.txt\")\n",
+    "\n",
+    "universe.len <- length(universe)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "source": [
+    "In brief:\n",
+    "- we loop through each dimension and we select for each dimension the top features\n",
+    "- we loop through each TF (using a groupby) and we identify the top sets of features\n",
+    "- we make a hypergeometric test \n",
+    "- we calculate an enrichment score enrichment\n",
+    "- we correct the pvalue using Benjamini-hochberg correction\n",
+    "- we write the results to a file (only for significant TFs enriched)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "\n",
+    "# directory that will store the results\n",
+    "res.dir <- \"tf_enrichment\"\n",
+    "\n",
+    "# colnames of the dataframe storing the results\n",
+    "res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n",
+    "\n",
+    "# loop through all the top mowgli dimensions\n",
+    "for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n",
+    "    dim_number <- str_extract(dim, \"\\\\d+$\")\n",
+    "    print(paste(\"enriching:\", dim_number))\n",
+    "    \n",
+    "    # select the top features for that dimensiob\n",
+    "    top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n",
+    "    top_genes.len <- length(top_genes) # should always be 100\n",
+    "    ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n",
+    "\n",
+    "    # open the output file\n",
+    "    output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
+    "    res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n",
+    "    colnames(res.df) <- res.colnames\n",
+    "   \n",
+    "    # define the list to store files \n",
+    "    p_values <- numeric()\n",
+    "    tfs <- character()\n",
+    "    n_targets <- numeric()\n",
+    "    enrichment_scores <- numeric()\n",
+    "\n",
+    "    # loop through all the TFs and perform the enrichment\n",
+    "    for (tf in names(tf_list)){\n",
+    "        targets <- tf_list[[tf]]\n",
+    "        \n",
+    "        x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n",
+    "        m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n",
+    "        n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n",
+    "        k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n",
+    "        \n",
+    "        p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n",
+    "        n_targets_expressed <- length(intersect(targets, rownames(universe)))\n",
+    "        n_targets_feats <- x/universe.len\n",
+    "        ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n",
+    "        enrichment_score <- 1/ (ratioTargets / ratioDim)\n",
+    "        \n",
+    "        # Store results for FDR adjustment\n",
+    "        p_values <- c(p_values, p_value)\n",
+    "        tfs <- c(tfs, tf)\n",
+    "        n_targets <- c(n_targets, x)\n",
+    "        enrichment_scores <- c(enrichment_scores, enrichment_score)\n",
+    "        }\n",
+    "\n",
+    "    # Adjust p-values using Benjamini-Hochberg correction\n",
+    "    adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n",
+    "\n",
+    "    # Combine results into a dataframe\n",
+    "    results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n",
+    "    \n",
+    "    # Filter results for significant adjusted p-values\n",
+    "    significant_results <- results[results$adjusted_p_value <= 0.05, ]\n",
+    "    \n",
+    "    # Save significant results to file\n",
+    "    output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
+    "    write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n",
+    "}"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.8.5 ('base')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "nbsphinx": {
+   "execute": "never"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}