546 lines (545 with data), 17.8 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Motif and Transcription Factor enrichment after Mowgli integration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook sums up guidelines to: \n",
"\n",
"* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n",
"* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n",
"* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n",
"\n",
"**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
"\n",
"**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n",
"\n",
"**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract top features of a modality from Mowgli"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We assume that the results of Mowgli integration are in a `mdata` object that store genes in the `rna` and peaks in the `atac` slot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Python\n",
"# method to extract the top features\n",
"def top_mowgli(dim, n, H_mowgli):\n",
" \"\"\"\n",
" Get the top n peaks for a given dimension.\n",
" \"\"\"\n",
" H_scaled = H_mowgli / H_mowgli.sum(axis=1, keepdims=True)\n",
" return H_scaled[:, dim].argsort()[::-1][:n]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract the top peaks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Python\n",
"# Peaks\n",
"\n",
"# we set the number of peaks to look at\n",
"n_peaks = 100\n",
"\n",
"# Get the genes or peak dictionaries\n",
"H_mowgli_atac = mdata[\"atac\"].uns[\"H_OT\"]\n",
"\n",
"# actual features extraction\n",
"mdata[\"atac\"].var_names = mdata[\"atac\"].var_names.str.replace(\"atac:\", \"\")\n",
"top_in_mowgli = mdata[\"atac\"].var.copy()\n",
"\n",
"# Fill the Mowgli top peaks.\n",
"for dim in range(H_mowgli_atac.shape[1]):\n",
" col_name = f\"top_in_dim_{dim}\"\n",
" idx = top_in_mowgli.index[top_mowgli(dim, n_peaks, H_mowgli_atac)]\n",
" top_in_mowgli[col_name] = False\n",
" top_in_mowgli.loc[idx, col_name] = True\n",
"\n",
"# Save Mowgli's top peaks.\n",
"top_in_mowgli.to_csv(\"top_peaks_in_mowgli.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract the top genes (for other expression-space based enrchments)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Python\n",
"# we set the number of peaks to look at\n",
"n_genes = 100\n",
"\n",
"H_mowgli_rna = mdata[\"rna\"].uns[\"H_OT\"]\n",
"\n",
"# select the top genes to probe using only the highly variable genes (our universe)\n",
"top_in_mowgli = (\n",
" mdata[\"rna\"].var.loc[mdata[\"rna\"].var[\"highly_variable\"] == True, :].copy()\n",
") # the var coordinates\n",
"\n",
"for dim in range(H_mowgli_rna.shape[1]):\n",
" print(dim)\n",
" # name of the column iun the var object that will be used to extract the top peaks for each gfiven dimenssion\n",
" col_name = f\"top_in_dim_{dim}\"\n",
" idx = top_in_mowgli.index[\n",
" top_mowgli(dim, n_genes, H_mowgli_rna)\n",
" ] # indices of the top features for that given dimensions. will be used for localizing the peaks afterwasrds\n",
" top_in_mowgli[col_name] = False # set all value for that dimesions to False\n",
" top_in_mowgli.loc[idx, col_name] = True # set to True only the peaks that are\n",
"\n",
"# Save Mowgli's top genes.\n",
"top_in_mowgli.to_csv(\n",
" os.path.join(top_feats_dir, f\"top_genes_in mowgli.csv\"),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Motif enrichment \n",
"\n",
"This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C). This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Imports.\n",
"library(GenomicRanges)\n",
"library(motifmatchr)\n",
"library(chromVAR)\n",
"library(TFBSTools)\n",
"library(JASPAR2022)\n",
"library(Signac)\n",
"library(BSgenome.Hsapiens.UCSC.hg38)\n",
"library(chromVARmotifs)\n",
"library(MuData)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Read atac file.\n",
"in_atac <- \"top_peaks_in_mowgli.csv\" # nolint\n",
"peaks_csv <- read.csv(in_atac, row.names = 2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Optional: Remove non-canonical chromosomes.\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000194.1\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270721.1\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n",
"peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270713.1\",]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Convert the peaks to GRanges.\n",
"chromosomes <- peaks_csv[\"Chromosome\"][, 1]\n",
"ranges <- IRanges::IRanges(\n",
" start = peaks_csv[\"Start\"][, 1],\n",
" end = peaks_csv[\"End\"][, 1]\n",
")\n",
"peaks <- GenomicRanges::GRanges(seqnames = chromosomes, ranges = ranges)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Get JASPAR motifs.\n",
"opts <- list()\n",
"opts[\"species\"] <- \"Homo sapiens\"\n",
"opts[\"collection\"] <- \"CORE\"\n",
"motifs <- TFBSTools::getMatrixSet(JASPAR2022::JASPAR2022, opts)\n",
"motifs_pwm <- TFBSTools::toPWM(motifs)\n",
"\n",
"# Get cisBP motifs.\n",
"data(\"human_pwms_v2\")\n",
"\n",
"# Fuse JASPAR and cisBP motifs.\n",
"for (name in names(motifs_pwm)) {\n",
" human_pwms_v2[name] <- motifs_pwm[name]\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Create a 'dummy' Signac object from the peaks.\n",
"# Actually giving peaks_csv is nonsense.\n",
"# But we only care about the rownames so it's fine.\n",
"assay <- Signac::CreateChromatinAssay(\n",
" peaks_csv,\n",
" ranges = peaks,\n",
" sep = c(\":\", \"-\")\n",
")\n",
"\n",
"# Create statistics about peaks.\n",
"assay <- Signac::RegionStats(\n",
" object = assay,\n",
" genome = BSgenome.Hsapiens.UCSC.hg38\n",
")\n",
"\n",
"# Add the downloaded motif PWM annotation.\n",
"assay <- Signac::AddMotifs(\n",
" object = assay,\n",
" genome = BSgenome.Hsapiens.UCSC.hg38,\n",
" pfm = human_pwms_v2\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"# Define where to save the motif enrichment outputs.\n",
"out_motif <- \"motifs_\"\n",
"\n",
"# Get all top peaks.\n",
"background <- c()\n",
"for (dim in 0:49) {\n",
"\n",
" # Get the top peaks for that dimension.\n",
" features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n",
"\n",
" background <- c(background, features)\n",
"}\n",
"\n",
"# Iterate over Mowgli's dimensions.\n",
"for (dim in 0:49) {\n",
"\n",
" # Get the top peaks for that dimension.\n",
" features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n",
"\n",
" # Do motif enrichment analysis.\n",
" enriched_motifs <- Signac::FindMotifs(\n",
" object = assay,\n",
" features = features,\n",
" background = background\n",
" )\n",
"\n",
" # Save the enrichment.\n",
" write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"source": [
"## TF Enrichment using top genes in mowgli dimensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"source": [
"We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n",
"\n",
"In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Reload libraries\n",
"\n",
"library(stats)\n",
"library(dplyr)\n",
"library(stringr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R \n",
"\n",
"# reload GRN and make a TF-> Target list\n",
"\n",
"# network of epithelial cells\n",
"grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n",
"grn <- read.table(grn.path, sep=\"\\t\", header = F)\n",
"colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n",
"\n",
"# make a TF -> Target list\n",
"tf_list <- split(grn$Target, grn$TF)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"\n",
"# reload top features for RNA and reparse it\n",
"top_feats.path <- \"top_genes_in_mowgli.csv\"\n",
"top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n",
"# set row names to index\n",
"row.names(top_feats) <- top_feats$hgnc_symbol\n",
"\n",
"cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n",
"\n",
"top_feats.filtered <- top_feats %>%\n",
" select(all_of(cols_to_keep))\n",
"\n",
"top_feats.filtered <- top_feats.filtered %>%\n",
" mutate(\n",
" `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n",
" across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n",
" )\n",
"\n",
"# define the universe -> in this case, a list of highly variable genes in the RNA slot\n",
"universe <- readLines(\"highly_variable_genes.txt\")\n",
"\n",
"universe.len <- length(universe)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"source": [
"In brief:\n",
"- we loop through each dimension and we select for each dimension the top features\n",
"- we loop through each TF (using a groupby) and we identify the top sets of features\n",
"- we make a hypergeometric test \n",
"- we calculate an enrichment score enrichment\n",
"- we correct the pvalue using Benjamini-hochberg correction\n",
"- we write the results to a file (only for significant TFs enriched)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"\n",
"# directory that will store the results\n",
"res.dir <- \"tf_enrichment\"\n",
"\n",
"# colnames of the dataframe storing the results\n",
"res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n",
"\n",
"# loop through all the top mowgli dimensions\n",
"for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n",
" dim_number <- str_extract(dim, \"\\\\d+$\")\n",
" print(paste(\"enriching:\", dim_number))\n",
" \n",
" # select the top features for that dimensiob\n",
" top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n",
" top_genes.len <- length(top_genes) # should always be 100\n",
" ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n",
"\n",
" # open the output file\n",
" output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
" res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n",
" colnames(res.df) <- res.colnames\n",
" \n",
" # define the list to store files \n",
" p_values <- numeric()\n",
" tfs <- character()\n",
" n_targets <- numeric()\n",
" enrichment_scores <- numeric()\n",
"\n",
" # loop through all the TFs and perform the enrichment\n",
" for (tf in names(tf_list)){\n",
" targets <- tf_list[[tf]]\n",
" \n",
" x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n",
" m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n",
" n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n",
" k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n",
" \n",
" p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n",
" n_targets_expressed <- length(intersect(targets, rownames(universe)))\n",
" n_targets_feats <- x/universe.len\n",
" ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n",
" enrichment_score <- 1/ (ratioTargets / ratioDim)\n",
" \n",
" # Store results for FDR adjustment\n",
" p_values <- c(p_values, p_value)\n",
" tfs <- c(tfs, tf)\n",
" n_targets <- c(n_targets, x)\n",
" enrichment_scores <- c(enrichment_scores, enrichment_score)\n",
" }\n",
"\n",
" # Adjust p-values using Benjamini-Hochberg correction\n",
" adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n",
"\n",
" # Combine results into a dataframe\n",
" results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n",
" \n",
" # Filter results for significant adjusted p-values\n",
" significant_results <- results[results$adjusted_p_value <= 0.05, ]\n",
" \n",
" # Save significant results to file\n",
" output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
" write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n",
"}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.5 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"nbsphinx": {
"execute": "never"
}
},
"nbformat": 4,
"nbformat_minor": 2
}