st-histology-ml / Git / Diff of /EDA.Rmd

Models:

DanielG/

st-histology-ml

Downloads: 1

Diff of /EDA.Rmd [000000] .. [c9fae8]

Switch to unified view

 b/EDA.Rmd
+---
+title: "R Notebook"
+output: html_notebook
+editor_options:
+  markdown:
+    wrap: 72
+---
+# Exploratory Data Analysis and Feature Extraction
+In this markdown, exploratory data analysis is performed on sample
+, with the goal of discovering genes whose expression values are
+correlated with simple histology image features (RGB/Grayscale color
+space). Sample 151507 includes 4136 spots and 21131 genes after
+performing QC on spots and removing genes with an expression level of 0
+for all spots.
+First, RGB/grayscale color channels are extracted from the image. All
+images are in lowres, meaning that they each have size 600 x 600. Then,
+the 4136 x 9 dataframe 'data' is created. Each row of data represents a
+barcode/spot, and the following column data is represented:
+"pxl_col_in_fullres": column number of barcode in fullres image
+"pxl_row_in_fullres": row number of barcode in fullres image
+"barcode": string containing barcode
+"scaled_pxl_col_in_lowres": column number of barcode in lowres image
+"scaled_pxl_row_in_lowres": row number of barcode in lowres image
+"red": red channel value of barcode
+"green": green channel value of barcode
+"blue": blue channel value of barcode
+"grey": grey channel value of barcode
+Then, correlation + the removal of outliers is performed in two ways,
+with the second being optimal. 1. Correlation between gene expression
+values and RGB/Grayscale pixel values is performed; then, outlier pixel
+and gene expression values are removed. 2. Outliers are removed first,
+then correlation is performed.
+## Imports
+```{r Import, message=FALSE}
+library(spatialLIBD)
+library(scuttle)
+library(dplyr)
+library(ggcorrplot)
+library(ggplot2)
+library(EBImage)
+library(ggpubr)
+library(foreach)
+library(doParallel)
+library(plotwidgets)
+library(Dict)
+library(lme4)
+registerDoParallel(4)
+library(reticulate)
+py_install("tensorflow")
+py_install("keras")
+layers <- import("tensorflow.keras.layers")
+models <- import("tensorflow.keras.models")
+optimizers <- import("tensorflow.keras.optimizers")
+```
+## Finding Features of Interest From Color Spaces
+The first features we want to extract from the sample are from the
+RGBGrayscale + HSL color spaces. We can first scale the pxl values in
+the SpatialExperiment object to fit the lowres image, and then determine
+the feature values for each pixel and add them to the SpatialExperiment
+object.
+```{r Helper}
+### Helper Functions for Finding Features of Interest
+get_img <- function(sample_id) {
+  png_link <- paste('https://spatial-dlpfc.s3.us-east-2.amazonaws.com/images/',sample_id,'_tissue_lowres_image.png',sep='')
+  img <- readImage(png_link)
+  img
+}
+get_rgbg <- function(img) {
+  # create 600x600 color channel matrices for each channel
+  red <- imageData(EBImage::channel(img, 'red'))
+  green <- imageData(EBImage::channel(img, 'green'))
+  blue <- imageData(EBImage::channel(img, 'blue'))
+  grey <- imageData(EBImage::channel(img, 'grey'))
+  list('red'=red, 'green'=green, 'blue'=blue, 'grey'=grey)
+}
+visualize_color_channels <- function(rgbg) {
+  hist(rgbg$red)
+  hist(rgbg$green)
+  hist(rgbg$blue)
+  hist(rgbg$grey)
+}
+get_hsl <- function(rgbg) {
+  rgb_mat <- t(data.frame(c(rgbg$red), c(rgbg$green), c(rgbg$blue)))
+  hsl_mat <- rgb2hsl(rgb_mat)
+  # HSL color space
+  hue <- matrix(hsl_mat[1,], nrow=600, ncol=600)
+  saturation <- matrix(hsl_mat[2,], nrow=600, ncol=600)
+  lightness <- matrix(hsl_mat[3,], nrow=600, ncol=600)
+  list("hue"=hue, "saturation"=saturation, "lightness"=lightness)
+}
+add_scaled_pxls_lowres <- function(spe_sub) {
+  scaling <- SpatialExperiment::scaleFactors(spe_sub)
+  pxl_col_in_lowres <-round(spatialCoords(spe_sub)[,'pxl_col_in_fullres'] * scaling)
+  pxl_row_in_lowres <-round(spatialCoords(spe_sub)[,'pxl_row_in_fullres'] * scaling)
+  spatialCoords(spe_sub) <- cbind(spatialCoords(spe_sub), pxl_col_in_lowres, pxl_row_in_lowres)
+  spe_sub
+}
+match_all_color_spaces <- function(color, pxl_indices) {
+  apply(pxl_indices, MARGIN=1, match_each_color_space, color=color)
+}
+match_each_color_space <- function(pxl_indices, color) {
+  color[pxl_indices[1], pxl_indices[2]]
+}
+add_color_features <- function(spe_sub, colors) {
+  pxl_indices <- spatialCoords(spe_sub)[,c('pxl_row_in_lowres','pxl_col_in_lowres')]
+  feature_df <- lapply(c(rgbg, hsl), match_all_color_spaces, pxl_indices=pxl_indices)
+  spatialCoords(spe_sub) <- cbind(spatialCoords(spe_sub),
+                                feature_df$red,
+                                feature_df$green,
+                                feature_df$blue,
+                                feature_df$grey,
+                                feature_df$hue,
+                                feature_df$saturation,
+                                feature_df$lightness)
+  colnames(spatialCoords(spe_sub)) <- c('pxl_col_in_fullres',
+                                        'pxl_row_in_fullres',
+                                        'pxl_col_in_lowres',
+                                        'pxl_row_in_lowres',
+                                        'red', 'green', 'blue', 'grey',
+                                        'hue', 'saturation', 'lightness')
+  spe_sub
+}
+```
+```{r}
+# Get image and RGBG/HSL feature spaces
+img <- get_img("151673")
+rgbg <- get_rgbg(img)
+hsl <- get_hsl(rgbg)
+# Add low resolution pixel values to SpatialExperiment object
+spe_sub <- add_scaled_pxls_lowres(spe_sub)
+# Match color space values to pixel row/column location, and add to SpatialExperiment object
+spe_sub <- add_color_features(spe_sub, c(rgbg, hsl))
+```
+## Finding Features of Interest From Autoencoder Latent Space
+Another approach to finding features of interest is by using an encoder
+model to capture the latent space between an encoder/decoder autoencoder
+model. Then, we can use these low-dimensional features to perform
+classification.
+To build this model, we first need to split out input image into image
+'patches' centered at each spot. The size of these patches was chosen to
+be 14x14 pixels. Then, for each spot, the corresponding image patch will
+be encoded into a feature vector.
+To simplify this process, a 'Spot' class has been created to store the
+current pixel values of the spot and features from the latent space.
+```{r}
+create_spot_instance <- function(rowname, img, spe_sub) {
+  coords = spatialCoords(spe_sub)[rowname,]
+  row = coords['pxl_row_in_lowres']
+  col = coords['pxl_col_in_lowres']
+  patch = imageData(img[(row - 6): (row + 6), (col - 6): (col + 6),])
+  new("Spot",
+      sampleId=imgData(spe_sub)$sample_id,
+      barcode=rowname,
+      imagePatch=patch,
+      spatialCoords=coords)
+}
+setClass("Spot",
+         slots=list(sampleId="character",
+                    barcode="character",
+                    imagePatch="array",
+                    spatialCoords="numeric",
+                    latentSpace="array"))
+vec <- lapply(rownames(spatialCoords(spe_sub)), create_spot_instance, img=img, spe_sub=spe_sub)
+```
+```{r}
+encoder_input = layers$Input(shape=as.integer(507), name='encoder_input')
+encoder_dense_layer1 = layers$Dense(units=as.integer(300), name='encoder_dense_1')(encoder_input)
+encoder_activ_layer1 = layers$LeakyReLU(name="encoder_leakyrelu_1")(encoder_dense_layer1)
+encoder_dense_layer2 = layers$Dense(units=as.integer(2), name='encoder_dense_2')(encoder_activ_layer1)
+encoder_output = layers$LeakyReLU(name='encoder_leakyrelu_2')(encoder_dense_layer2)
+encoder = models$Model(encoder_input, encoder_output, name="encoder_model")
+decoder_input = layers$Input(shape=as.integer(2), name='decoder_input')
+decoder_dense_layer1 = layers$Dense(units=as.integer(300), name='decoder_dense_1')(decoder_input)
+decoder_activ_layer1 = layers$LeakyReLU(name="decoder_leakyrelu_1")(decoder_dense_layer1)
+decoder_dense_layer2 = layers$Dense(units=as.integer(507), name='decoder_dense_2')(decoder_activ_layer1)
+decoder_output = layers$LeakyReLU(name='decoder_leakyrelu_2')(decoder_dense_layer2)
+decoder = models$Model(decoder_input, decoder_output, name="decoder_model")
+ae_input = layers$Input(shape=as.integer(507), name="AE_input")
+ae_encoder_output = encoder(ae_input)
+ae_decoder_output = decoder(ae_encoder_output)
+```
+```{r}
+get_encoded_features <- function(spot) {
+  temp <- c(spot@imagePatch)
+  dim(temp) <- c(1, length(temp))
+  ae = models$Model(ae_input, ae_decoder_output, name="AE")
+  ae$compile(loss="mse", optimizer=optimizers$Adam(learning_rate=0.0005))
+  ae$fit(temp, temp, epochs=as.integer(50), shuffle=TRUE, verbose=as.integer(0))
+  encoding = encoder$predict(temp, verbose=as.integer(0))
+  spot@latentSpace <- encoding
+  spot
+}
+extract_encodings <- function(spot) {
+  c(spot@latentSpace, spot@spatialCoords)
+}
+extract_barcodes <- function(spot) {
+  spot@barcode
+}
+add_gene_col <- function(g_name, features, logcounts, spe_sub) {
+  if (!(g_name %in% colnames(features))) {
+    features[,ncol(features) + 1] <- data.frame(logcounts[,get_gene_id(g_name, spe_sub)])
+    colnames(features)[ncol(features)] <- g_name
+  }
+  features
+}
+temp_vec_1 <- lapply(vec[1:500], get_encoded_features)
+temp_vec_2 <- lapply(vec[501:1000], get_encoded_features)
+temp_vec_3 <- lapply(vec[1001:1500], get_encoded_features)
+temp_vec_4 <- lapply(vec[1501:2000], get_encoded_features)
+temp_vec_5 <- lapply(vec[2001:2500], get_encoded_features)
+temp_vec_6 <- lapply(vec[2501:3000], get_encoded_features)
+temp_vec_7 <- lapply(vec[3001:3590], get_encoded_features)
+vec <- c(temp_vec_1, temp_vec_2, temp_vec_3, temp_vec_4, temp_vec_5, temp_vec_6, temp_vec_7)
+features <- t(sapply(vec, extract_encodings))
+rownames(features) <- sapply(vec, extract_barcodes)
+colnames(features)[1:2] <- c('latent_space_1', 'latent_space_2')
+features <- add_gene_col('MOBP', features, logcounts, spe_sub)
+features <- add_gene_col('SNAP25', features, logcounts, spe_sub)
+features <- add_gene_col('PCP4', features, logcounts, spe_sub)
+```
+```{r}
+# plot MOBP, SNAP25, PCP4 in 2D feature space
+print(ggplot(features, aes(x=features[,'latent_space_1'], y = features[,'latent_space_2'])) + geom_point(aes(color=features[,'MOBP'])) + theme(aspect.ratio = 1) + scale_color_gradient(low='pink', high='purple') + ggtitle(paste("MOBP Expression Level in 2D Latent Space")))
+print(ggplot(features, aes(x=features[,'latent_space_1'], y = features[,'latent_space_2'])) + geom_point(aes(color=features[,'SNAP25'])) + theme(aspect.ratio = 1) + scale_color_gradient(low='pink', high='purple') + ggtitle(paste("SNAP25 Expression Level in 2D Latent Space")))
+print(ggplot(features, aes(x=features[,'latent_space_1'], y = features[,'latent_space_2'])) + geom_point(aes(color=features[,'PCP4'])) + theme(aspect.ratio = 1) + scale_color_gradient(low='pink', high='purple') + ggtitle(paste("PCP4 Expression Level in 2D Latent Space")))
+```
+```{r}
+# Plot two feature latent space
+features <- data.frame(features)
+# Spot Plots for Autoencoder Features
+print(ggplot(features, aes(x = features[,'pxl_row_in_lowres'], y=600 - features[,'pxl_col_in_lowres'])) + geom_point(aes(color=features[,'latent_space_1'])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",'latent_space_1')))
+print(ggplot(features, aes(x = features[,'pxl_row_in_lowres'], y=600 - features[,'pxl_col_in_lowres'])) + geom_point(aes(color=features[,'latent_space_2'])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",'latent_space_2')))
+```
+```{r}
+mat <- compute_corr_matrix(logcounts, features[,1:13])
+top_corr <- get_top_k_corr(mat, 10)
+plot_corr_matrix(data.frame(top_corr))
+```
+## Finding Genes of Interest
+An overall goal of this exploratory data analysis is to identify genes
+of interest so that we can build predictive models in the future. To
+accomplish this, a correlation matrix between spots and image features
+(color spaces) is created. Then, the genes exhibiting the highest
+correlation to image features are selected and plotted.
+```{r}
+get_gene_name <- function(x, spe_sub) {
+  rowData(spe_sub)[rowData(spe_sub)$gene_id == x,]$gene_name
+}
+get_gene_id <- function(x, spe_sub) {
+  rowData(spe_sub)[rowData(spe_sub)$gene_name == x ,]$gene_id
+}
+create_id_name_dict <- function(logcounts, spe_sub) {
+  id_to_name <- mclapply(colnames(logcounts), get_gene_name, spe_sub=spe_sub)
+  names(id_to_name) <- colnames(logcounts)
+  id_to_name
+}
+# compute correlation matrix in parallel
+compute_corr_matrix <- function(expression_arr, color_arr, round=4) {
+  foreach(i = 1:ncol(expression_arr),
+  .combine = rbind,
+  .packages = c('data.table', 'doParallel')) %dopar% {
+    colName <- colnames(expression_arr)[i]
+    df <- data.frame(round(cor(expression_arr[,i], color_arr, method = 'pearson', use="complete.obs"), round))
+    rownames(df) <- colName
+    df
+  }
+}
+get_top_k_corr <- function(mat, k) {
+  temp_ind <- c()
+  for (i in 1:ncol(mat)) {
+    temp_ind <- c(temp_ind, order(abs(mat[,i]), decreasing=T)[1:k])
+  }
+  top_corr_ind <- unique(temp_ind)
+  # temp_ind <- c(temp_ind, order(abs(as.numeric(unlist(mat))), decreasing=T)[1:k])
+  # temp_ind <- temp_ind %% 600
+  top_corr <- rename_gene_ids(data.frame(mat[top_corr_ind,]))
+  top_corr
+}
+# Replace gene IDs with gene names
+rename_gene_ids <- function(correlation_matrix) {
+  for (i in 1:nrow(correlation_matrix)) {
+    g_name <- rownames(correlation_matrix)[i]
+    rownames(correlation_matrix)[i] <- rowData(spe_sub[g_name])$gene_name
+  }
+  correlation_matrix
+}
+# Plot correlation matrix
+plot_corr_matrix <- function(correlation_matrix) {
+  ggcorrplot(correlation_matrix, sig.level=0.01, lab_size = 4.5, p.mat = NULL,
+           insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1,
+           tl.cex = 7) +
+  theme(axis.text.x = element_text(margin=margin(-2,0,0,0)),
+        axis.text.y = element_text(margin=margin(0,-2,0,0)),
+        panel.grid.minor = element_line(size=7)) +
+  geom_tile(fill="white") +
+  geom_tile(height=1, width=1)
+}
+get_gene_id <- function(g_name, spe) {
+  gene_id = rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
+  gene_id
+}
+```
+```{r}
+gene_id <- get_gene_id('TERB1', spe_sub)
+mat[gene_id,]
+```
+```{r}
+logcounts <- t(logcounts(spe_sub))
+id_to_name <- create_id_name_dict(logcounts, spe_sub)
+mat <- compute_corr_matrix(logcounts,
+                           spatialCoords(spe_sub)[,5:ncol(spatialCoords(spe_sub))])
+top_corr <- get_top_k_corr(mat, 20)
+plot_corr_matrix(data.frame(top_corr))
+```
+```{r}
+mat[get_gene_id('SNAP25', spe_sub),]
+top_corr
+```
+## Spot Plots of Features
+```{r}
+temp_df <- data.frame(spatialCoords(spe_sub))
+for (g_name in colnames(temp_df[,5:11])) {
+  # print(paste(" Number of expressed barcodes for ", g_name, ": ",sum(data[,g_name] > 0), "/", length(data[,g_name])))
+  print(ggplot(temp_df, aes(x = temp_df[,'pxl_row_in_lowres'], y=600 - temp_df[,'pxl_col_in_lowres'])) + geom_point(aes(color=temp_df[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
+}
+```
+```{r}
+# Add genes
+```
+```{r}
+temp_df <- data.frame(spatialCoords(spe_sub))
+for (g_name in colnames(temp_df[,5:11])) {
+  # print(paste(" Number of expressed barcodes for ", g_name, ": ",sum(data[,g_name] > 0), "/", length(data[,g_name])))
+  print(ggplot(temp_df, aes(x = temp_df[,'pxl_row_in_lowres'], y=600 - temp_df[,'pxl_col_in_lowres'])) + geom_point(aes(color=temp_df[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
+}
+```
+### Create RGB and HSL color channels and plot histograms of distribution
+```{r, message=FALSE}
+png_link <- paste('https://spatial-dlpfc.s3.us-east-2.amazonaws.com/images/',sample_id,'_tissue_lowres_image.png',sep='')
+img <- readImage(png_link)
+# create 600x600 color channel matrices for each channel
+red <- imageData(EBImage::channel(img, 'red'))
+green <- imageData(EBImage::channel(img, 'green'))
+blue <- imageData(EBImage::channel(img, 'blue'))
+grey <- imageData(EBImage::channel(img, 'grey'))
+# visualize histograms of each color channel
+hist(red)
+hist(green)
+hist(blue)
+hist(grey)
+rgb_mat <- t(data.frame(c(red), c(green), c(blue)))
+hsl_mat <- rgb2hsl(rgb_mat)
+# HSL color space
+hue <- Matrix(hsl_mat[1,], nrow=600, ncol=600)
+saturation <- Matrix(hsl_mat[2,], nrow=600, ncol=600)
+lightness <- Matrix(hsl_mat[3,], nrow=600, ncol=600)
+# get scale factor
+scaling <- SpatialExperiment::scaleFactors(spe_sub)
+# store spatial coords information (pxl col and row numbers in fullres image for each barcode)
+spatial_coords <- spatialCoords(spe_sub)
+# create data frame to hold pixel locations and pixel color channel values
+data <- data.frame(spatial_coords, row.names=NULL)
+data['barcode'] <- rownames(spatial_coords)
+# add scaled pixel columns
+new <- ceiling(data[["pxl_col_in_fullres"]] * scaling)
+data[ , ncol(data) + 1] <- new
+colnames(data)[ncol(data)] <- "scaled_pxl_col_in_lowres"
+# add scaled pixel rows
+new <- ceiling(data[["pxl_row_in_fullres"]] * scaling)
+data[ , ncol(data) + 1] <- new
+colnames(data)[ncol(data)] <- "scaled_pxl_row_in_lowres"
+# create vectors to be added to dataframe
+X_r <- vector(mode="double", length = dim(data)[1])
+X_g <- vector(mode="double", length = dim(data)[1])
+X_b <- vector(mode="double", length = dim(data)[1])
+X_grey <- vector(mode="double", length = dim(data)[1])
+X_h <- vector(mode="double", length = dim(data)[1])
+X_s <- vector(mode="double", length = dim(data)[1])
+X_l <- vector(mode="double", length = dim(data)[1])
+# iterate over dataframe, add rgb values
+for (i in 1:nrow(data)) {
+    col_index <- data[i,]$scaled_pxl_col_in_lowres
+    row_index <- data[i,]$scaled_pxl_row_in_lowres
+    X_r[i] <- red[row_index, col_index]
+    X_g[i] <- green[row_index, col_index]
+    X_b[i] <- blue[row_index, col_index]
+    X_grey[i] <- grey[row_index, col_index]
+    X_h[i] <- hue[row_index, col_index]
+    X_s[i] <- saturation[row_index, col_index]
+    X_l[i] <- lightness[row_index, col_index]
+}
+data[ , ncol(data) + 1] <- X_r
+colnames(data)[ncol(data)] <- "red"
+data[ , ncol(data) + 1] <- X_g
+colnames(data)[ncol(data)] <- "green"
+data[ , ncol(data) + 1] <- X_b
+colnames(data)[ncol(data)] <- "blue"
+data[ , ncol(data) + 1] <- X_grey
+colnames(data)[ncol(data)] <- "grey"
+data[ , ncol(data) + 1] <- X_h
+colnames(data)[ncol(data)] <- "hue"
+data[ , ncol(data) + 1] <- X_s
+colnames(data)[ncol(data)] <- "saturation"
+data[ , ncol(data) + 1] <- X_l
+colnames(data)[ncol(data)] <- "lightness"
+```
+### Functions for high res images, where there exist more than one pixel per spot
+```{r, message=FALSE}
+# Mode Function
+getmode <- function(v) {
+   uniqv <- unique(v)
+   uniqv[which.max(tabulate(match(v, uniqv)))]
+}
+# For a spot 'i', plot histogram of pixel color values given by 'color'
+plot_pixels_per_spot <- function(ind_barcodes, data, color, i) {
+  # histogram of color channel values in one spot
+  barcode = rownames(ind_barcodes)[i]
+  x <- data[data$barcode == barcode, color]
+  hist(x, main=paste("Histogram of ",  color, " Color Channel Values For a Single Barcode", sep = ''),breaks=100)
+}
+# Create a df where rows are barcodes and columns are mean, median, and mode values for each RGB/Grayscale category
+populate_ind_barcodes <- function(i, ind_barcodes) {
+  colors <- c("red", "green", "blue", "grey")
+  for (j in 1:length(colors)) {
+    color <- colors[j]
+    vals <- data[data$barcode == row.names(ind_barcodes)[i],][,color]
+    ind_barcodes[i, paste("mean_", color, sep="")] <- mean(vals)
+    ind_barcodes[i, paste("median_", color, sep="")] <- median(vals)
+    ind_barcodes[i, paste("mode_", color, sep="")] <- getmode(vals)
+  }
+  ind_barcodes
+}
+# create empty df
+ind_barcodes <- data.frame(matrix(ncol = 12, nrow = length(unique(data[,'barcode']))))
+colnames(ind_barcodes) <- c('mean_red', 'median_red', 'mode_red', 'mean_green', 'median_green', 'mode_green', 'mean_blue', 'median_blue', 'mode_blue', 'mean_grey', 'median_grey', 'mode_grey')
+rownames(ind_barcodes) <- unique(data[, 'barcode'])
+# add mean, median, mode data
+for (i in 1:nrow(ind_barcodes)) {
+ ind_barcodes <- populate_ind_barcodes(i, ind_barcodes)
+}
+ind_barcodes
+```
+### Analysis 1: Compute Correlation First, Compute Outliers After
+In the following blocks, the correlation between all genes + pixel color
+channel values is calculated. Then, for the top performing genes, a
+plotting function is called in which spots corresponding to outlier gene
+expression values + spots corresponding to outlier pixel color channel
+values are removed.
+```{r, message=FALSE}
+# compute correlation matrix in parallel
+compute_corr_matrix <- function(expression_arr, color_arr, round=4) {
+  foreach(i = 1:ncol(expression_arr),
+  .combine = rbind,
+  .packages = c('data.table', 'doParallel')) %dopar% {
+    colName <- colnames(expression_arr)[i]
+    df <- data.frame(round(cor(expression_arr[,i], color_arr, method = 'pearson', use="complete.obs"), round))
+    rownames(df) <- colName
+    df
+  }
+}
+# Filter correlation matrix based on a threshold and replace gene IDs with gene names
+filter_corr_matrix <- function(correlation_matrix, threshold) {
+  filter <- as.data.frame(apply(abs(correlation_matrix) >= threshold, 1, any))
+  correlation_matrix_filtered <- correlation_matrix[filter[,1],]
+  correlation_matrix_filtered <- rename_gene_ids(correlation_matrix_filtered)
+  correlation_matrix_filtered
+}
+# Replace gene IDs with gene names
+rename_gene_ids <- function(correlation_matrix) {
+  for (i in 1:nrow(correlation_matrix)) {
+    g_name <- rownames(correlation_matrix)[i]
+    rownames(correlation_matrix)[i] <- rowData(spe_sub[g_name])$gene_name
+  }
+  correlation_matrix
+}
+# Plot correlation matrix
+plot_corr_matrix <- function(correlation_matrix) {
+  ggcorrplot(correlation_matrix, sig.level=0.01, lab_size = 4.5, p.mat = NULL,
+           insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1,
+           tl.cex = 7) +
+  theme(axis.text.x = element_text(margin=margin(-2,0,0,0)),
+        axis.text.y = element_text(margin=margin(0,-2,0,0)),
+        panel.grid.minor = element_line(size=7)) +
+  geom_tile(fill="white") +
+  geom_tile(height=1, width=1)
+}
+# function to calculate regression equation. Used as a sanity check
+lm_eqn <- function(vec1, vec2){
+    m <- lm(vec1 ~ vec2, data.frame(vec1, vec2));
+    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
+         list(a = format(unname(coef(m)[1]), digits = 2),
+              b = format(unname(coef(m)[2]), digits = 2),
+             r2 = format(summary(m)$r.squared, digits = 3)))
+    as.character(as.expression(eq));
+}
+```
+```{r, message=FALSE}
+color_arr <- data[,6:ncol(data)]
+correlation_matrix <- compute_corr_matrix(logcounts_151507_filtered, color_arr)
+correlation_matrix_filtered <- filter_corr_matrix(correlation_matrix, 0.3)
+plot_corr_matrix(correlation_matrix_filtered)
+```
+```{r, message=FALSE}
+# plot color channel value vs gene expression. If remove_outliers == TRUE, outlier gene expression values + outlier pixel color values are removed.
+# ASSUMES THAT EACH BARCODE HAS MULTIPLE PIXELS
+plot_channel_highres <- function(g_name, gene_id, color, stat, log_arr, color_arr, remove_outliers) {
+  stat_type <- paste(color, stat, sep="_")
+  x_axis <- log_arr[,gene_id]
+  y_axis <- color_arr[,stat_type] * 255
+  df <- data.frame(x_axis, y_axis)
+  if (remove_outliers) {
+    barcode_names <- rownames(log_arr)
+    overlap <- union(rownames(log_arr)[isOutlier(log_arr[,gene_id])], rownames(color_arr)[isOutlier(color_arr[,stat_type])])
+    rownames(df) <- barcode_names
+    df <- df[-which(rownames(df) %in% overlap),]
+    title <- paste(stat_type, " Channel Values vs ", g_name, " Gene Expression with Outliers Removed", sep="")
+  } else {
+    title <- paste(stat_type, " Channel Values vs ", g_name, " Gene Expression without Outliers Removed", sep="")
+  }
+  ggplot(df, aes(x=df[,1], y=df[,2])) + xlab(g_name) + ylab(stat_type) + geom_point(color=color) +
+  geom_smooth(method='lm', color='black') + stat_regline_equation(label.y = 190, aes(label = ..eq.label..)) +
+  stat_regline_equation(label.y = 180, aes(label = ..rr.label..)) + ggtitle(title)
+}
+# call plot_channel function for all color and mean/median/mode combinations
+# ASSUMES THAT EACH BARCODE HAS MULTIPLE PIXELS
+plot_all_channels_highres <- function(g_name, log_arr, color_arr, remove_outliers) {
+  gene_id <- rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
+  colors <- c('red', 'green', 'blue', 'grey')
+  stats <- c('mean', 'median', 'mode')
+  for (color in colors) {
+    for (stat in stats) {
+      print(plot_channel_highres(g_name, gene_id, color, stat, log_arr, color_arr, remove_outliers))
+    }
+  }
+}
+# plot color channel value vs gene expression. If remove_outliers == TRUE, outlier gene expression values + outlier pixel color values are removed.
+# ASSUMES THAT EACH BARCODE HAS ONE PIXEL
+plot_channel <- function(g_name, gene_id, color, log_arr, color_arr, remove_outliers) {
+  x_axis <- log_arr[,gene_id]
+  y_axis <- color_arr[,color] * 255
+  df <- data.frame(x_axis, y_axis)
+  if (remove_outliers) {
+    barcode_names <- rownames(log_arr)
+    overlap <- union(rownames(log_arr)[isOutlier(log_arr[,gene_id])], rownames(color_arr)[isOutlier(color_arr[,color])])
+    rownames(df) <- barcode_names
+    df <- df[-which(rownames(df) %in% overlap),]
+    title <- paste(color, " Channel Values vs ", g_name, " Gene Expression with Outliers Removed", sep="")
+  } else {
+    title <- paste(color, " Channel Values vs ", g_name, " Gene Expression without Outliers Removed", sep="")
+  }
+  ggplot(df, aes(x=df[,1], y=df[,2])) + xlab(g_name) + ylab(color) + geom_point(color=color) +
+  geom_smooth(method='lm', color='black') + stat_regline_equation(label.y = 190, aes(label = ..eq.label..)) +
+  stat_regline_equation(label.y = 180, aes(label = ..rr.label..)) + ggtitle(title)
+}
+# call plot_channel function for all colors
+# ASSUMES THAT EACH BARCODE HAS ONE PIXEL
+plot_all_channels <- function(g_name, log_arr, color_arr, remove_outliers) {
+  gene_id <- rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
+  colors <- c('red', 'green', 'blue', 'grey')
+  for (color in colors) {
+    print(plot_channel(g_name, gene_id, color, log_arr, color_arr, remove_outliers))
+  }
+}
+```
+```{r, message=FALSE}
+plot_all_channels('MT-ND1', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
+plot_all_channels('MT-ND1', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
+```
+```{r, message=FALSE}
+plot_all_channels('MT-CO2', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
+plot_all_channels('MT-CO2', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
+```
+```{r, message=FALSE}
+plot_all_channels('MT-ATP6', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
+plot_all_channels('MT-ATP6', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
+```
+```{r, message=FALSE}
+plot_all_channels('MT-CYB', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
+plot_all_channels('MT-CYB', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
+```
+## Redo Analysis Removing Outliers Before Correlation Computation
+In the following blocks, analysis is redone by removing outliers before
+correlation computation based on the isOutlier() function. If the number
+of non-NA expression values \< num = 100, the gene is removed from
+consideration. Then, correlation between the RGB/Grayscale values and
+gene expression values is computed.
+```{r, message=FALSE}
+# Remove outlier gene expression values. If median gene expression value is 0, replace complement of outliers with NA. Else replace outliers with NA. If remaining non-NA expression values < num, replace entire column with NA.
+remove_outliers <- function(arr, num) {
+  foreach(i = 1:ncol(arr),
+  .combine = cbind,
+  .packages = c('data.table', 'doParallel')) %dopar% {
+    if (median(arr[,i]) == 0) {
+      arr[,i][!isOutlier(arr[,i])] <- NA
+    } else {
+      arr[,i][isOutlier(arr[,i])] <- NA
+    }
+    if (matrixStats::count(!is.na(arr[,i])) < num) {
+      arr[,i] <- rep(NA, nrow(arr))
+    }
+    arr[,i]
+  }
+}
+get_top_k_corr <- function(mat, k) {
+  temp_ind <- c()
+  for (i in 1:ncol(mat)) {
+    temp_ind <- c(temp_ind, order(abs(mat[,i]), decreasing=T)[1:k])
+  }
+  top_corr_ind <- unique(temp_ind)
+  top_corr <- rename_gene_ids(data.frame(mat[top_corr_ind,]))
+  top_corr
+}
+get_gene_id <- function(g_name, spe) {
+  gene_id = rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
+  gene_id
+}
+add_gene_col_data_df <- function(gene_id, g_name, data) {
+  if (!(g_name %in% colnames(data))) {
+    data[,ncol(data) + 1] <- data.frame(logcounts_151673_filtered[,gene_id])
+    colnames(data)[ncol(data)] <- g_name
+  }
+  data
+}
+drop_cols_data_df <- function(drop, data) {
+  data <- data[,!(names(data) %in% drop)]
+  data
+}
+range01 <- function(x){(x-min(x))/(max(x)-min(x))}
+```
+```{r}
+g_name <- 'MOBP'
+gene_id <- get_gene_id(g_name, spe)
+data <- add_gene_col_data_df(gene_id, g_name, data)
+```
+```{r}
+#temp <- data.frame(spatialCoords(spe_sub), vec1)
+g_name='vec1'
+temp <- data.frame(spatialCoords(spe_sub), vec1)
+print(ggplot(temp, aes(x = temp[,'pxl_row_in_lowres'], y=600 - temp[,'pxl_col_in_lowres'])) + geom_point(aes(color=temp[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
+```
+```{r}
+for (g_name in colnames(data[6:12])) {
+  print(ggplot(data, aes(x = data[,'scaled_pxl_row_in_lowres'], y=600 - data[,'scaled_pxl_col_in_lowres'])) + geom_point(aes(color=data[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
+}
+```
+```{r}
+top_corr
+```
+```{r}
+mat <- compute_corr_matrix(logcounts_151673_filtered, data[6:12])
+top_corr <- get_top_k_corr(mat, 10)
+plot_corr_matrix(data.frame(top_corr))
+```
+```{r}
+for (g_name in rownames(top_corr)) {
+  gene_id <- get_gene_id(g_name, spe)
+  data <- add_gene_col_data_df(gene_id, g_name, data)
+}
+data
+```
+```{r}
+for (g_name in colnames(data[13:ncol(data)])) {
+  print(paste(" Number of expressed barcodes for ", g_name, ": ",sum(data[,g_name] > 0), "/", length(data[,g_name])))
+  print(ggplot(data, aes(x = data[,'scaled_pxl_row_in_lowres'], y=600 - data[,'scaled_pxl_col_in_lowres'])) + geom_point(aes(color=data[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
+}
+```
+```{r}
+logcounts_151673[,i] > 0
+sum(logcounts_151673_filtered[,gene_id] > 0)
+sum(!is.na(logcounts_151673_filtered[,gene_id]))
+compute_corr_matrix(data.frame(logcounts_151673_filtered), data[,6:12])
+compute_corr_matrix(data.frame(logcounts_151673_filtered_no_outliers[,gene_id]), temp_data[,6:12])
+```
+```{r}
+g_name = 'SNAP25'
+gene_id = rowData(spe)[rowData(spe)$gene_name == 'SNAP25',]$gene_id
+print(sum(logcounts_151507[,gene_id] != 0))
+print(sum(logcounts_151507_filtered[,gene_id] != 0))
+print(sum(!is.na(logcounts_151507_filtered_no_outliers[,gene_id])))
+mat_with_outliers <- compute_corr_matrix(data.frame(logcounts_151507_filtered[,gene_id]), data[,4:5])
+mat_with_outliers
+mat_outliers <- compute_corr_matrix(data.frame(logcounts_151507_filtered_no_outliers[,gene_id]), data[,4:5])
+mat_outliers
+```
+```{r, message=FALSE}
+# Call remove outlier function
+logcounts_temp <- as.matrix(logcounts_151673_filtered)
+col_names <- colnames(logcounts_temp)
+logcounts_151673_filtered_no_outliers <- remove_outliers(logcounts_temp, 100)
+colnames(logcounts_151673_filtered_no_outliers) <- col_names
+# Remove columns that are completely NA
+not_all_na <- function(x) any(!is.na(x))
+logcounts_151673_filtered_no_outliers <- data.frame(logcounts_151673_filtered_no_outliers) %>% select(where(not_all_na))
+# compute correlation matrix
+mat <- compute_corr_matrix(logcounts_151673_filtered_no_outliers, data[6:12])
+```
+```{r, message=FALSE}
+top_corr <- get_top_k_corr(mat, 10)
+plot_corr_matrix(top_corr)
+```
+```{r}
+# MOBP, SNAP25, plot
+# Linear regression model
+# Where do MOBP/SNAP25 fall in my analysis
+# When did it get removed, spot plots of these genes
+# spot plots of top genes in correlation analysis
+gene_id = rowData(spe)[rowData(spe)$gene_name == 'LFNG',]$gene_id
+sum(!is.na(logcounts_151507_filtered_no_outliers[,gene_id]))
+```
+```{r, message=FALSE}
+plot_all_channels('LFNG', logcounts_151507_filtered_no_outliers, color_arr, remove_outliers=FALSE)
+```
+```{r, message=FALSE}
+plot_all_channels('ELK4', logcounts_151507_filtered_no_outliers, color_arr, remove_outliers=FALSE)
+```
+```{r, message=FALSE}
+plot_all_channels('PLIN4', logcounts_151507_filtered_no_outliers, color_arr, remove_outliers=FALSE)
+```