--- a +++ b/docs/source/vignettes/TF enrichment.ipynb @@ -0,0 +1,545 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Motif and Transcription Factor enrichment after Mowgli integration" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook sums up guidelines to: \n", + "\n", + "* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n", + "* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n", + "* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n", + "\n", + "**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n", + "\n", + "**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n", + "\n", + "**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extract top features of a modality from Mowgli" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We assume that the results of Mowgli integration are in a `mdata` object that store genes in the `rna` and peaks in the `atac` slot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Python\n", + "# method to extract the top features\n", + "def top_mowgli(dim, n, H_mowgli):\n", + " \"\"\"\n", + " Get the top n peaks for a given dimension.\n", + " \"\"\"\n", + " H_scaled = H_mowgli / H_mowgli.sum(axis=1, keepdims=True)\n", + " return H_scaled[:, dim].argsort()[::-1][:n]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extract the top peaks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Python\n", + "# Peaks\n", + "\n", + "# we set the number of peaks to look at\n", + "n_peaks = 100\n", + "\n", + "# Get the genes or peak dictionaries\n", + "H_mowgli_atac = mdata[\"atac\"].uns[\"H_OT\"]\n", + "\n", + "# actual features extraction\n", + "mdata[\"atac\"].var_names = mdata[\"atac\"].var_names.str.replace(\"atac:\", \"\")\n", + "top_in_mowgli = mdata[\"atac\"].var.copy()\n", + "\n", + "# Fill the Mowgli top peaks.\n", + "for dim in range(H_mowgli_atac.shape[1]):\n", + " col_name = f\"top_in_dim_{dim}\"\n", + " idx = top_in_mowgli.index[top_mowgli(dim, n_peaks, H_mowgli_atac)]\n", + " top_in_mowgli[col_name] = False\n", + " top_in_mowgli.loc[idx, col_name] = True\n", + "\n", + "# Save Mowgli's top peaks.\n", + "top_in_mowgli.to_csv(\"top_peaks_in_mowgli.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extract the top genes (for other expression-space based enrchments)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Python\n", + "# we set the number of peaks to look at\n", + "n_genes = 100\n", + "\n", + "H_mowgli_rna = mdata[\"rna\"].uns[\"H_OT\"]\n", + "\n", + "# select the top genes to probe using only the highly variable genes (our universe)\n", + "top_in_mowgli = (\n", + " mdata[\"rna\"].var.loc[mdata[\"rna\"].var[\"highly_variable\"] == True, :].copy()\n", + ") # the var coordinates\n", + "\n", + "for dim in range(H_mowgli_rna.shape[1]):\n", + " print(dim)\n", + " # name of the column iun the var object that will be used to extract the top peaks for each gfiven dimenssion\n", + " col_name = f\"top_in_dim_{dim}\"\n", + " idx = top_in_mowgli.index[\n", + " top_mowgli(dim, n_genes, H_mowgli_rna)\n", + " ] # indices of the top features for that given dimensions. will be used for localizing the peaks afterwasrds\n", + " top_in_mowgli[col_name] = False # set all value for that dimesions to False\n", + " top_in_mowgli.loc[idx, col_name] = True # set to True only the peaks that are\n", + "\n", + "# Save Mowgli's top genes.\n", + "top_in_mowgli.to_csv(\n", + " os.path.join(top_feats_dir, f\"top_genes_in mowgli.csv\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Motif enrichment \n", + "\n", + "This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C). This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Imports.\n", + "library(GenomicRanges)\n", + "library(motifmatchr)\n", + "library(chromVAR)\n", + "library(TFBSTools)\n", + "library(JASPAR2022)\n", + "library(Signac)\n", + "library(BSgenome.Hsapiens.UCSC.hg38)\n", + "library(chromVARmotifs)\n", + "library(MuData)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Read atac file.\n", + "in_atac <- \"top_peaks_in_mowgli.csv\" # nolint\n", + "peaks_csv <- read.csv(in_atac, row.names = 2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Optional: Remove non-canonical chromosomes.\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000194.1\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270721.1\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n", + "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270713.1\",]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Convert the peaks to GRanges.\n", + "chromosomes <- peaks_csv[\"Chromosome\"][, 1]\n", + "ranges <- IRanges::IRanges(\n", + " start = peaks_csv[\"Start\"][, 1],\n", + " end = peaks_csv[\"End\"][, 1]\n", + ")\n", + "peaks <- GenomicRanges::GRanges(seqnames = chromosomes, ranges = ranges)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Get JASPAR motifs.\n", + "opts <- list()\n", + "opts[\"species\"] <- \"Homo sapiens\"\n", + "opts[\"collection\"] <- \"CORE\"\n", + "motifs <- TFBSTools::getMatrixSet(JASPAR2022::JASPAR2022, opts)\n", + "motifs_pwm <- TFBSTools::toPWM(motifs)\n", + "\n", + "# Get cisBP motifs.\n", + "data(\"human_pwms_v2\")\n", + "\n", + "# Fuse JASPAR and cisBP motifs.\n", + "for (name in names(motifs_pwm)) {\n", + " human_pwms_v2[name] <- motifs_pwm[name]\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Create a 'dummy' Signac object from the peaks.\n", + "# Actually giving peaks_csv is nonsense.\n", + "# But we only care about the rownames so it's fine.\n", + "assay <- Signac::CreateChromatinAssay(\n", + " peaks_csv,\n", + " ranges = peaks,\n", + " sep = c(\":\", \"-\")\n", + ")\n", + "\n", + "# Create statistics about peaks.\n", + "assay <- Signac::RegionStats(\n", + " object = assay,\n", + " genome = BSgenome.Hsapiens.UCSC.hg38\n", + ")\n", + "\n", + "# Add the downloaded motif PWM annotation.\n", + "assay <- Signac::AddMotifs(\n", + " object = assay,\n", + " genome = BSgenome.Hsapiens.UCSC.hg38,\n", + " pfm = human_pwms_v2\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "# Define where to save the motif enrichment outputs.\n", + "out_motif <- \"motifs_\"\n", + "\n", + "# Get all top peaks.\n", + "background <- c()\n", + "for (dim in 0:49) {\n", + "\n", + " # Get the top peaks for that dimension.\n", + " features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n", + "\n", + " background <- c(background, features)\n", + "}\n", + "\n", + "# Iterate over Mowgli's dimensions.\n", + "for (dim in 0:49) {\n", + "\n", + " # Get the top peaks for that dimension.\n", + " features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n", + "\n", + " # Do motif enrichment analysis.\n", + " enriched_motifs <- Signac::FindMotifs(\n", + " object = assay,\n", + " features = features,\n", + " background = background\n", + " )\n", + "\n", + " # Save the enrichment.\n", + " write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "source": [ + "## TF Enrichment using top genes in mowgli dimensions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "source": [ + "We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n", + "\n", + "In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Reload libraries\n", + "\n", + "library(stats)\n", + "library(dplyr)\n", + "library(stringr)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R \n", + "\n", + "# reload GRN and make a TF-> Target list\n", + "\n", + "# network of epithelial cells\n", + "grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n", + "grn <- read.table(grn.path, sep=\"\\t\", header = F)\n", + "colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n", + "\n", + "# make a TF -> Target list\n", + "tf_list <- split(grn$Target, grn$TF)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "\n", + "# reload top features for RNA and reparse it\n", + "top_feats.path <- \"top_genes_in_mowgli.csv\"\n", + "top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n", + "# set row names to index\n", + "row.names(top_feats) <- top_feats$hgnc_symbol\n", + "\n", + "cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n", + "\n", + "top_feats.filtered <- top_feats %>%\n", + " select(all_of(cols_to_keep))\n", + "\n", + "top_feats.filtered <- top_feats.filtered %>%\n", + " mutate(\n", + " `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n", + " across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n", + " )\n", + "\n", + "# define the universe -> in this case, a list of highly variable genes in the RNA slot\n", + "universe <- readLines(\"highly_variable_genes.txt\")\n", + "\n", + "universe.len <- length(universe)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "source": [ + "In brief:\n", + "- we loop through each dimension and we select for each dimension the top features\n", + "- we loop through each TF (using a groupby) and we identify the top sets of features\n", + "- we make a hypergeometric test \n", + "- we calculate an enrichment score enrichment\n", + "- we correct the pvalue using Benjamini-hochberg correction\n", + "- we write the results to a file (only for significant TFs enriched)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# R\n", + "\n", + "# directory that will store the results\n", + "res.dir <- \"tf_enrichment\"\n", + "\n", + "# colnames of the dataframe storing the results\n", + "res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n", + "\n", + "# loop through all the top mowgli dimensions\n", + "for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n", + " dim_number <- str_extract(dim, \"\\\\d+$\")\n", + " print(paste(\"enriching:\", dim_number))\n", + " \n", + " # select the top features for that dimensiob\n", + " top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n", + " top_genes.len <- length(top_genes) # should always be 100\n", + " ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n", + "\n", + " # open the output file\n", + " output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n", + " res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n", + " colnames(res.df) <- res.colnames\n", + " \n", + " # define the list to store files \n", + " p_values <- numeric()\n", + " tfs <- character()\n", + " n_targets <- numeric()\n", + " enrichment_scores <- numeric()\n", + "\n", + " # loop through all the TFs and perform the enrichment\n", + " for (tf in names(tf_list)){\n", + " targets <- tf_list[[tf]]\n", + " \n", + " x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n", + " m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n", + " n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n", + " k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n", + " \n", + " p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n", + " n_targets_expressed <- length(intersect(targets, rownames(universe)))\n", + " n_targets_feats <- x/universe.len\n", + " ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n", + " enrichment_score <- 1/ (ratioTargets / ratioDim)\n", + " \n", + " # Store results for FDR adjustment\n", + " p_values <- c(p_values, p_value)\n", + " tfs <- c(tfs, tf)\n", + " n_targets <- c(n_targets, x)\n", + " enrichment_scores <- c(enrichment_scores, enrichment_score)\n", + " }\n", + "\n", + " # Adjust p-values using Benjamini-Hochberg correction\n", + " adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n", + "\n", + " # Combine results into a dataframe\n", + " results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n", + " \n", + " # Filter results for significant adjusted p-values\n", + " significant_results <- results[results$adjusted_p_value <= 0.05, ]\n", + " \n", + " # Save significant results to file\n", + " output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n", + " write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n", + "}" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.8.5 ('base')", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + }, + "nbsphinx": { + "execute": "never" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}