st-histology-ml / Git / [c9fae8] /EDA.Rmd

Models:
DanielG/
st-histology-ml
Downloads: 1
[c9fae8]: / EDA.Rmd
History
Download this file
869 lines (718 with data), 32.5 kB

---
title: "R Notebook"
output: html_notebook
editor_options: 
  markdown: 
    wrap: 72
---

# Exploratory Data Analysis and Feature Extraction

In this markdown, exploratory data analysis is performed on sample
151507, with the goal of discovering genes whose expression values are
correlated with simple histology image features (RGB/Grayscale color
space). Sample 151507 includes 4136 spots and 21131 genes after
performing QC on spots and removing genes with an expression level of 0
for all spots.

First, RGB/grayscale color channels are extracted from the image. All
images are in lowres, meaning that they each have size 600 x 600. Then,
the 4136 x 9 dataframe 'data' is created. Each row of data represents a
barcode/spot, and the following column data is represented:

"pxl_col_in_fullres": column number of barcode in fullres image

"pxl_row_in_fullres": row number of barcode in fullres image

"barcode": string containing barcode

"scaled_pxl_col_in_lowres": column number of barcode in lowres image

"scaled_pxl_row_in_lowres": row number of barcode in lowres image

"red": red channel value of barcode

"green": green channel value of barcode

"blue": blue channel value of barcode

"grey": grey channel value of barcode

Then, correlation + the removal of outliers is performed in two ways,
with the second being optimal. 1. Correlation between gene expression
values and RGB/Grayscale pixel values is performed; then, outlier pixel
and gene expression values are removed. 2. Outliers are removed first,
then correlation is performed.

## Imports

```{r Import, message=FALSE}
library(spatialLIBD)
library(scuttle)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(EBImage)
library(ggpubr)
library(foreach)
library(doParallel)
library(plotwidgets)
library(Dict)
library(lme4)
registerDoParallel(4)
library(reticulate)

py_install("tensorflow")
py_install("keras")

layers <- import("tensorflow.keras.layers")
models <- import("tensorflow.keras.models")
optimizers <- import("tensorflow.keras.optimizers")
```

## Finding Features of Interest From Color Spaces

The first features we want to extract from the sample are from the
RGBGrayscale + HSL color spaces. We can first scale the pxl values in
the SpatialExperiment object to fit the lowres image, and then determine
the feature values for each pixel and add them to the SpatialExperiment
object.

```{r Helper}
### Helper Functions for Finding Features of Interest

get_img <- function(sample_id) {
  png_link <- paste('https://spatial-dlpfc.s3.us-east-2.amazonaws.com/images/',sample_id,'_tissue_lowres_image.png',sep='')
  img <- readImage(png_link)
  img
}

get_rgbg <- function(img) {
  # create 600x600 color channel matrices for each channel
  red <- imageData(EBImage::channel(img, 'red'))
  green <- imageData(EBImage::channel(img, 'green'))
  blue <- imageData(EBImage::channel(img, 'blue'))
  grey <- imageData(EBImage::channel(img, 'grey'))
  list('red'=red, 'green'=green, 'blue'=blue, 'grey'=grey)
}

visualize_color_channels <- function(rgbg) {
  hist(rgbg$red)
  hist(rgbg$green)
  hist(rgbg$blue)
  hist(rgbg$grey)
}

get_hsl <- function(rgbg) {
  rgb_mat <- t(data.frame(c(rgbg$red), c(rgbg$green), c(rgbg$blue)))
  hsl_mat <- rgb2hsl(rgb_mat)
  
  # HSL color space
  hue <- matrix(hsl_mat[1,], nrow=600, ncol=600)
  saturation <- matrix(hsl_mat[2,], nrow=600, ncol=600)
  lightness <- matrix(hsl_mat[3,], nrow=600, ncol=600)
  
  list("hue"=hue, "saturation"=saturation, "lightness"=lightness)
}

add_scaled_pxls_lowres <- function(spe_sub) {
  scaling <- SpatialExperiment::scaleFactors(spe_sub)
  pxl_col_in_lowres <-round(spatialCoords(spe_sub)[,'pxl_col_in_fullres'] * scaling)
  pxl_row_in_lowres <-round(spatialCoords(spe_sub)[,'pxl_row_in_fullres'] * scaling)
  spatialCoords(spe_sub) <- cbind(spatialCoords(spe_sub), pxl_col_in_lowres, pxl_row_in_lowres)
  spe_sub
}

match_all_color_spaces <- function(color, pxl_indices) {
  apply(pxl_indices, MARGIN=1, match_each_color_space, color=color)
}

match_each_color_space <- function(pxl_indices, color) {
  color[pxl_indices[1], pxl_indices[2]]
}

add_color_features <- function(spe_sub, colors) {
  pxl_indices <- spatialCoords(spe_sub)[,c('pxl_row_in_lowres','pxl_col_in_lowres')]
  feature_df <- lapply(c(rgbg, hsl), match_all_color_spaces, pxl_indices=pxl_indices)
  spatialCoords(spe_sub) <- cbind(spatialCoords(spe_sub), 
                                feature_df$red,
                                feature_df$green,
                                feature_df$blue,
                                feature_df$grey,
                                feature_df$hue,
                                feature_df$saturation,
                                feature_df$lightness)
  colnames(spatialCoords(spe_sub)) <- c('pxl_col_in_fullres',
                                        'pxl_row_in_fullres',
                                        'pxl_col_in_lowres',
                                        'pxl_row_in_lowres',
                                        'red', 'green', 'blue', 'grey', 
                                        'hue', 'saturation', 'lightness')
  spe_sub
}
```

```{r}
# Get image and RGBG/HSL feature spaces
img <- get_img("151673")
rgbg <- get_rgbg(img)
hsl <- get_hsl(rgbg)

# Add low resolution pixel values to SpatialExperiment object
spe_sub <- add_scaled_pxls_lowres(spe_sub)

# Match color space values to pixel row/column location, and add to SpatialExperiment object
spe_sub <- add_color_features(spe_sub, c(rgbg, hsl))
```

## Finding Features of Interest From Autoencoder Latent Space

Another approach to finding features of interest is by using an encoder
model to capture the latent space between an encoder/decoder autoencoder
model. Then, we can use these low-dimensional features to perform
classification.

To build this model, we first need to split out input image into image
'patches' centered at each spot. The size of these patches was chosen to
be 14x14 pixels. Then, for each spot, the corresponding image patch will
be encoded into a feature vector.

To simplify this process, a 'Spot' class has been created to store the
current pixel values of the spot and features from the latent space.

```{r}
create_spot_instance <- function(rowname, img, spe_sub) {
  coords = spatialCoords(spe_sub)[rowname,]
  row = coords['pxl_row_in_lowres']
  col = coords['pxl_col_in_lowres']
  
  patch = imageData(img[(row - 6): (row + 6), (col - 6): (col + 6),])
  
  new("Spot", 
      sampleId=imgData(spe_sub)$sample_id, 
      barcode=rowname,
      imagePatch=patch,
      spatialCoords=coords)
}

setClass("Spot", 
         slots=list(sampleId="character", 
                    barcode="character", 
                    imagePatch="array", 
                    spatialCoords="numeric", 
                    latentSpace="array"))

vec <- lapply(rownames(spatialCoords(spe_sub)), create_spot_instance, img=img, spe_sub=spe_sub)
```

```{r}
encoder_input = layers$Input(shape=as.integer(507), name='encoder_input')
encoder_dense_layer1 = layers$Dense(units=as.integer(300), name='encoder_dense_1')(encoder_input)
encoder_activ_layer1 = layers$LeakyReLU(name="encoder_leakyrelu_1")(encoder_dense_layer1)
encoder_dense_layer2 = layers$Dense(units=as.integer(2), name='encoder_dense_2')(encoder_activ_layer1)
encoder_output = layers$LeakyReLU(name='encoder_leakyrelu_2')(encoder_dense_layer2)
encoder = models$Model(encoder_input, encoder_output, name="encoder_model")

decoder_input = layers$Input(shape=as.integer(2), name='decoder_input')
decoder_dense_layer1 = layers$Dense(units=as.integer(300), name='decoder_dense_1')(decoder_input)
decoder_activ_layer1 = layers$LeakyReLU(name="decoder_leakyrelu_1")(decoder_dense_layer1)
decoder_dense_layer2 = layers$Dense(units=as.integer(507), name='decoder_dense_2')(decoder_activ_layer1)
decoder_output = layers$LeakyReLU(name='decoder_leakyrelu_2')(decoder_dense_layer2)
decoder = models$Model(decoder_input, decoder_output, name="decoder_model")

ae_input = layers$Input(shape=as.integer(507), name="AE_input")
ae_encoder_output = encoder(ae_input)
ae_decoder_output = decoder(ae_encoder_output)
```

```{r}
get_encoded_features <- function(spot) {
  temp <- c(spot@imagePatch)
  dim(temp) <- c(1, length(temp))
  
  ae = models$Model(ae_input, ae_decoder_output, name="AE")
  ae$compile(loss="mse", optimizer=optimizers$Adam(learning_rate=0.0005))
  ae$fit(temp, temp, epochs=as.integer(50), shuffle=TRUE, verbose=as.integer(0))
  
  encoding = encoder$predict(temp, verbose=as.integer(0))
  spot@latentSpace <- encoding
  spot
}

extract_encodings <- function(spot) {
  c(spot@latentSpace, spot@spatialCoords)
}

extract_barcodes <- function(spot) {
  spot@barcode
}

add_gene_col <- function(g_name, features, logcounts, spe_sub) {
  if (!(g_name %in% colnames(features))) {
    features[,ncol(features) + 1] <- data.frame(logcounts[,get_gene_id(g_name, spe_sub)])
    colnames(features)[ncol(features)] <- g_name
  }
  features
}

temp_vec_1 <- lapply(vec[1:500], get_encoded_features)
temp_vec_2 <- lapply(vec[501:1000], get_encoded_features)
temp_vec_3 <- lapply(vec[1001:1500], get_encoded_features)
temp_vec_4 <- lapply(vec[1501:2000], get_encoded_features)
temp_vec_5 <- lapply(vec[2001:2500], get_encoded_features)
temp_vec_6 <- lapply(vec[2501:3000], get_encoded_features)
temp_vec_7 <- lapply(vec[3001:3590], get_encoded_features)

vec <- c(temp_vec_1, temp_vec_2, temp_vec_3, temp_vec_4, temp_vec_5, temp_vec_6, temp_vec_7)
features <- t(sapply(vec, extract_encodings))
rownames(features) <- sapply(vec, extract_barcodes)
colnames(features)[1:2] <- c('latent_space_1', 'latent_space_2')

features <- add_gene_col('MOBP', features, logcounts, spe_sub)
features <- add_gene_col('SNAP25', features, logcounts, spe_sub)
features <- add_gene_col('PCP4', features, logcounts, spe_sub)
```

```{r}
# plot MOBP, SNAP25, PCP4 in 2D feature space
print(ggplot(features, aes(x=features[,'latent_space_1'], y = features[,'latent_space_2'])) + geom_point(aes(color=features[,'MOBP'])) + theme(aspect.ratio = 1) + scale_color_gradient(low='pink', high='purple') + ggtitle(paste("MOBP Expression Level in 2D Latent Space")))

print(ggplot(features, aes(x=features[,'latent_space_1'], y = features[,'latent_space_2'])) + geom_point(aes(color=features[,'SNAP25'])) + theme(aspect.ratio = 1) + scale_color_gradient(low='pink', high='purple') + ggtitle(paste("SNAP25 Expression Level in 2D Latent Space")))

print(ggplot(features, aes(x=features[,'latent_space_1'], y = features[,'latent_space_2'])) + geom_point(aes(color=features[,'PCP4'])) + theme(aspect.ratio = 1) + scale_color_gradient(low='pink', high='purple') + ggtitle(paste("PCP4 Expression Level in 2D Latent Space")))
```

```{r}
# Plot two feature latent space
features <- data.frame(features)
# Spot Plots for Autoencoder Features
print(ggplot(features, aes(x = features[,'pxl_row_in_lowres'], y=600 - features[,'pxl_col_in_lowres'])) + geom_point(aes(color=features[,'latent_space_1'])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",'latent_space_1')))

print(ggplot(features, aes(x = features[,'pxl_row_in_lowres'], y=600 - features[,'pxl_col_in_lowres'])) + geom_point(aes(color=features[,'latent_space_2'])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",'latent_space_2')))
```

```{r}
mat <- compute_corr_matrix(logcounts, features[,1:13])

top_corr <- get_top_k_corr(mat, 10)

plot_corr_matrix(data.frame(top_corr))
```

## Finding Genes of Interest

An overall goal of this exploratory data analysis is to identify genes
of interest so that we can build predictive models in the future. To
accomplish this, a correlation matrix between spots and image features
(color spaces) is created. Then, the genes exhibiting the highest
correlation to image features are selected and plotted.

```{r}
get_gene_name <- function(x, spe_sub) {
  rowData(spe_sub)[rowData(spe_sub)$gene_id == x,]$gene_name
}

get_gene_id <- function(x, spe_sub) {
  rowData(spe_sub)[rowData(spe_sub)$gene_name == x ,]$gene_id
}

create_id_name_dict <- function(logcounts, spe_sub) {
  id_to_name <- mclapply(colnames(logcounts), get_gene_name, spe_sub=spe_sub)
  names(id_to_name) <- colnames(logcounts)
  id_to_name
}

# compute correlation matrix in parallel
compute_corr_matrix <- function(expression_arr, color_arr, round=4) {
  foreach(i = 1:ncol(expression_arr),
  .combine = rbind,
  .packages = c('data.table', 'doParallel')) %dopar% {
    colName <- colnames(expression_arr)[i]
    df <- data.frame(round(cor(expression_arr[,i], color_arr, method = 'pearson', use="complete.obs"), round))
    rownames(df) <- colName
    df
  }
}

get_top_k_corr <- function(mat, k) {
  temp_ind <- c()
  for (i in 1:ncol(mat)) {
    temp_ind <- c(temp_ind, order(abs(mat[,i]), decreasing=T)[1:k])
  }
  top_corr_ind <- unique(temp_ind)
  # temp_ind <- c(temp_ind, order(abs(as.numeric(unlist(mat))), decreasing=T)[1:k])
  # temp_ind <- temp_ind %% 600
  top_corr <- rename_gene_ids(data.frame(mat[top_corr_ind,]))
  top_corr
}

# Replace gene IDs with gene names
rename_gene_ids <- function(correlation_matrix) {
  for (i in 1:nrow(correlation_matrix)) {
    g_name <- rownames(correlation_matrix)[i]
    rownames(correlation_matrix)[i] <- rowData(spe_sub[g_name])$gene_name
  }
  correlation_matrix
}

# Plot correlation matrix
plot_corr_matrix <- function(correlation_matrix) {
  ggcorrplot(correlation_matrix, sig.level=0.01, lab_size = 4.5, p.mat = NULL,
           insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1,
           tl.cex = 7) +
  theme(axis.text.x = element_text(margin=margin(-2,0,0,0)),
        axis.text.y = element_text(margin=margin(0,-2,0,0)),
        panel.grid.minor = element_line(size=7)) + 
  geom_tile(fill="white") +
  geom_tile(height=1, width=1)
}

get_gene_id <- function(g_name, spe) {
  gene_id = rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
  gene_id
}
```

```{r}
gene_id <- get_gene_id('TERB1', spe_sub)
mat[gene_id,]

```

```{r}
logcounts <- t(logcounts(spe_sub))

id_to_name <- create_id_name_dict(logcounts, spe_sub)

mat <- compute_corr_matrix(logcounts, 
                           spatialCoords(spe_sub)[,5:ncol(spatialCoords(spe_sub))])

top_corr <- get_top_k_corr(mat, 20)

plot_corr_matrix(data.frame(top_corr))
```

```{r}
mat[get_gene_id('SNAP25', spe_sub),]
top_corr
```

## Spot Plots of Features

```{r}
temp_df <- data.frame(spatialCoords(spe_sub))
for (g_name in colnames(temp_df[,5:11])) {
  # print(paste(" Number of expressed barcodes for ", g_name, ": ",sum(data[,g_name] > 0), "/", length(data[,g_name])))
  print(ggplot(temp_df, aes(x = temp_df[,'pxl_row_in_lowres'], y=600 - temp_df[,'pxl_col_in_lowres'])) + geom_point(aes(color=temp_df[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
}
```

```{r}
# Add genes
```

```{r}
temp_df <- data.frame(spatialCoords(spe_sub))
for (g_name in colnames(temp_df[,5:11])) {
  # print(paste(" Number of expressed barcodes for ", g_name, ": ",sum(data[,g_name] > 0), "/", length(data[,g_name])))
  print(ggplot(temp_df, aes(x = temp_df[,'pxl_row_in_lowres'], y=600 - temp_df[,'pxl_col_in_lowres'])) + geom_point(aes(color=temp_df[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
}
```

### Create RGB and HSL color channels and plot histograms of distribution

```{r, message=FALSE}
png_link <- paste('https://spatial-dlpfc.s3.us-east-2.amazonaws.com/images/',sample_id,'_tissue_lowres_image.png',sep='')
img <- readImage(png_link)

# create 600x600 color channel matrices for each channel
red <- imageData(EBImage::channel(img, 'red'))
green <- imageData(EBImage::channel(img, 'green'))
blue <- imageData(EBImage::channel(img, 'blue'))
grey <- imageData(EBImage::channel(img, 'grey'))

# visualize histograms of each color channel
hist(red)
hist(green)
hist(blue)
hist(grey)

rgb_mat <- t(data.frame(c(red), c(green), c(blue)))
hsl_mat <- rgb2hsl(rgb_mat)

# HSL color space
hue <- Matrix(hsl_mat[1,], nrow=600, ncol=600)
saturation <- Matrix(hsl_mat[2,], nrow=600, ncol=600)
lightness <- Matrix(hsl_mat[3,], nrow=600, ncol=600)

# get scale factor
scaling <- SpatialExperiment::scaleFactors(spe_sub)

# store spatial coords information (pxl col and row numbers in fullres image for each barcode)
spatial_coords <- spatialCoords(spe_sub)

# create data frame to hold pixel locations and pixel color channel values
data <- data.frame(spatial_coords, row.names=NULL)
data['barcode'] <- rownames(spatial_coords)

# add scaled pixel columns
new <- ceiling(data[["pxl_col_in_fullres"]] * scaling)
data[ , ncol(data) + 1] <- new 
colnames(data)[ncol(data)] <- "scaled_pxl_col_in_lowres"

# add scaled pixel rows
new <- ceiling(data[["pxl_row_in_fullres"]] * scaling)
data[ , ncol(data) + 1] <- new 
colnames(data)[ncol(data)] <- "scaled_pxl_row_in_lowres"

# create vectors to be added to dataframe
X_r <- vector(mode="double", length = dim(data)[1])
X_g <- vector(mode="double", length = dim(data)[1])
X_b <- vector(mode="double", length = dim(data)[1])
X_grey <- vector(mode="double", length = dim(data)[1])
X_h <- vector(mode="double", length = dim(data)[1])
X_s <- vector(mode="double", length = dim(data)[1])
X_l <- vector(mode="double", length = dim(data)[1])

# iterate over dataframe, add rgb values
for (i in 1:nrow(data)) {
    col_index <- data[i,]$scaled_pxl_col_in_lowres
    row_index <- data[i,]$scaled_pxl_row_in_lowres
    X_r[i] <- red[row_index, col_index]
    X_g[i] <- green[row_index, col_index]
    X_b[i] <- blue[row_index, col_index]
    X_grey[i] <- grey[row_index, col_index]
    X_h[i] <- hue[row_index, col_index]
    X_s[i] <- saturation[row_index, col_index]
    X_l[i] <- lightness[row_index, col_index]
}

data[ , ncol(data) + 1] <- X_r
colnames(data)[ncol(data)] <- "red"
data[ , ncol(data) + 1] <- X_g
colnames(data)[ncol(data)] <- "green"
data[ , ncol(data) + 1] <- X_b
colnames(data)[ncol(data)] <- "blue"
data[ , ncol(data) + 1] <- X_grey
colnames(data)[ncol(data)] <- "grey"
data[ , ncol(data) + 1] <- X_h
colnames(data)[ncol(data)] <- "hue"
data[ , ncol(data) + 1] <- X_s
colnames(data)[ncol(data)] <- "saturation"
data[ , ncol(data) + 1] <- X_l
colnames(data)[ncol(data)] <- "lightness"
```

### Functions for high res images, where there exist more than one pixel per spot

```{r, message=FALSE}
# Mode Function
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# For a spot 'i', plot histogram of pixel color values given by 'color'
plot_pixels_per_spot <- function(ind_barcodes, data, color, i) {
  # histogram of color channel values in one spot
  barcode = rownames(ind_barcodes)[i]
  x <- data[data$barcode == barcode, color]
  hist(x, main=paste("Histogram of ",  color, " Color Channel Values For a Single Barcode", sep = ''),breaks=100)
}

# Create a df where rows are barcodes and columns are mean, median, and mode values for each RGB/Grayscale category
populate_ind_barcodes <- function(i, ind_barcodes) {
  colors <- c("red", "green", "blue", "grey")
  for (j in 1:length(colors)) {
    color <- colors[j]
    vals <- data[data$barcode == row.names(ind_barcodes)[i],][,color]
    ind_barcodes[i, paste("mean_", color, sep="")] <- mean(vals)
    ind_barcodes[i, paste("median_", color, sep="")] <- median(vals)
    ind_barcodes[i, paste("mode_", color, sep="")] <- getmode(vals) 
  }
  ind_barcodes
}

# create empty df
ind_barcodes <- data.frame(matrix(ncol = 12, nrow = length(unique(data[,'barcode']))))
colnames(ind_barcodes) <- c('mean_red', 'median_red', 'mode_red', 'mean_green', 'median_green', 'mode_green', 'mean_blue', 'median_blue', 'mode_blue', 'mean_grey', 'median_grey', 'mode_grey')
rownames(ind_barcodes) <- unique(data[, 'barcode'])

# add mean, median, mode data
for (i in 1:nrow(ind_barcodes)) {
 ind_barcodes <- populate_ind_barcodes(i, ind_barcodes)
}
ind_barcodes
```

### Analysis 1: Compute Correlation First, Compute Outliers After

In the following blocks, the correlation between all genes + pixel color
channel values is calculated. Then, for the top performing genes, a
plotting function is called in which spots corresponding to outlier gene
expression values + spots corresponding to outlier pixel color channel
values are removed.

```{r, message=FALSE}
# compute correlation matrix in parallel
compute_corr_matrix <- function(expression_arr, color_arr, round=4) {
  foreach(i = 1:ncol(expression_arr),
  .combine = rbind,
  .packages = c('data.table', 'doParallel')) %dopar% {
    colName <- colnames(expression_arr)[i]
    df <- data.frame(round(cor(expression_arr[,i], color_arr, method = 'pearson', use="complete.obs"), round))
    rownames(df) <- colName
    df
  }
}

# Filter correlation matrix based on a threshold and replace gene IDs with gene names
filter_corr_matrix <- function(correlation_matrix, threshold) {
  filter <- as.data.frame(apply(abs(correlation_matrix) >= threshold, 1, any))
  correlation_matrix_filtered <- correlation_matrix[filter[,1],]
  correlation_matrix_filtered <- rename_gene_ids(correlation_matrix_filtered)
  correlation_matrix_filtered
}

# Replace gene IDs with gene names
rename_gene_ids <- function(correlation_matrix) {
  for (i in 1:nrow(correlation_matrix)) {
    g_name <- rownames(correlation_matrix)[i]
    rownames(correlation_matrix)[i] <- rowData(spe_sub[g_name])$gene_name
  }
  correlation_matrix
}

# Plot correlation matrix
plot_corr_matrix <- function(correlation_matrix) {
  ggcorrplot(correlation_matrix, sig.level=0.01, lab_size = 4.5, p.mat = NULL,
           insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1,
           tl.cex = 7) +
  theme(axis.text.x = element_text(margin=margin(-2,0,0,0)),
        axis.text.y = element_text(margin=margin(0,-2,0,0)),
        panel.grid.minor = element_line(size=7)) + 
  geom_tile(fill="white") +
  geom_tile(height=1, width=1)
}

# function to calculate regression equation. Used as a sanity check
lm_eqn <- function(vec1, vec2){
    m <- lm(vec1 ~ vec2, data.frame(vec1, vec2));
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
         list(a = format(unname(coef(m)[1]), digits = 2),
              b = format(unname(coef(m)[2]), digits = 2),
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));
}
```

```{r, message=FALSE}
color_arr <- data[,6:ncol(data)]
correlation_matrix <- compute_corr_matrix(logcounts_151507_filtered, color_arr)
correlation_matrix_filtered <- filter_corr_matrix(correlation_matrix, 0.3)
plot_corr_matrix(correlation_matrix_filtered)
```

```{r, message=FALSE}
# plot color channel value vs gene expression. If remove_outliers == TRUE, outlier gene expression values + outlier pixel color values are removed.
# ASSUMES THAT EACH BARCODE HAS MULTIPLE PIXELS
plot_channel_highres <- function(g_name, gene_id, color, stat, log_arr, color_arr, remove_outliers) {
  stat_type <- paste(color, stat, sep="_")
  x_axis <- log_arr[,gene_id]
  y_axis <- color_arr[,stat_type] * 255
  df <- data.frame(x_axis, y_axis)
  if (remove_outliers) {
    barcode_names <- rownames(log_arr)
    overlap <- union(rownames(log_arr)[isOutlier(log_arr[,gene_id])], rownames(color_arr)[isOutlier(color_arr[,stat_type])])
    rownames(df) <- barcode_names
    df <- df[-which(rownames(df) %in% overlap),]
    title <- paste(stat_type, " Channel Values vs ", g_name, " Gene Expression with Outliers Removed", sep="")
  } else {
    title <- paste(stat_type, " Channel Values vs ", g_name, " Gene Expression without Outliers Removed", sep="")
  }
  ggplot(df, aes(x=df[,1], y=df[,2])) + xlab(g_name) + ylab(stat_type) + geom_point(color=color) +
  geom_smooth(method='lm', color='black') + stat_regline_equation(label.y = 190, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 180, aes(label = ..rr.label..)) + ggtitle(title) 
}

# call plot_channel function for all color and mean/median/mode combinations
# ASSUMES THAT EACH BARCODE HAS MULTIPLE PIXELS
plot_all_channels_highres <- function(g_name, log_arr, color_arr, remove_outliers) {
  gene_id <- rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
  colors <- c('red', 'green', 'blue', 'grey')
  stats <- c('mean', 'median', 'mode')

  for (color in colors) {
    for (stat in stats) {
      print(plot_channel_highres(g_name, gene_id, color, stat, log_arr, color_arr, remove_outliers)) 
    }
  }
}

# plot color channel value vs gene expression. If remove_outliers == TRUE, outlier gene expression values + outlier pixel color values are removed.
# ASSUMES THAT EACH BARCODE HAS ONE PIXEL
plot_channel <- function(g_name, gene_id, color, log_arr, color_arr, remove_outliers) {
  x_axis <- log_arr[,gene_id]
  y_axis <- color_arr[,color] * 255
  df <- data.frame(x_axis, y_axis)
  if (remove_outliers) {
    barcode_names <- rownames(log_arr)
    overlap <- union(rownames(log_arr)[isOutlier(log_arr[,gene_id])], rownames(color_arr)[isOutlier(color_arr[,color])])
    rownames(df) <- barcode_names
    df <- df[-which(rownames(df) %in% overlap),]
    title <- paste(color, " Channel Values vs ", g_name, " Gene Expression with Outliers Removed", sep="")
  } else {
    title <- paste(color, " Channel Values vs ", g_name, " Gene Expression without Outliers Removed", sep="")
  }
  ggplot(df, aes(x=df[,1], y=df[,2])) + xlab(g_name) + ylab(color) + geom_point(color=color) +
  geom_smooth(method='lm', color='black') + stat_regline_equation(label.y = 190, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 180, aes(label = ..rr.label..)) + ggtitle(title) 
}

# call plot_channel function for all colors
# ASSUMES THAT EACH BARCODE HAS ONE PIXEL
plot_all_channels <- function(g_name, log_arr, color_arr, remove_outliers) {
  gene_id <- rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
  colors <- c('red', 'green', 'blue', 'grey')

  for (color in colors) {
    print(plot_channel(g_name, gene_id, color, log_arr, color_arr, remove_outliers))
  }
}
```

```{r, message=FALSE}
plot_all_channels('MT-ND1', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
plot_all_channels('MT-ND1', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
```

```{r, message=FALSE}
plot_all_channels('MT-CO2', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
plot_all_channels('MT-CO2', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
```

```{r, message=FALSE}
plot_all_channels('MT-ATP6', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
plot_all_channels('MT-ATP6', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
```

```{r, message=FALSE}
plot_all_channels('MT-CYB', logcounts_151507_filtered, color_arr, remove_outliers=FALSE)
plot_all_channels('MT-CYB', logcounts_151507_filtered, color_arr, remove_outliers=TRUE)
```

## Redo Analysis Removing Outliers Before Correlation Computation

In the following blocks, analysis is redone by removing outliers before
correlation computation based on the isOutlier() function. If the number
of non-NA expression values \< num = 100, the gene is removed from
consideration. Then, correlation between the RGB/Grayscale values and
gene expression values is computed.

```{r, message=FALSE}
# Remove outlier gene expression values. If median gene expression value is 0, replace complement of outliers with NA. Else replace outliers with NA. If remaining non-NA expression values < num, replace entire column with NA.
remove_outliers <- function(arr, num) {
  foreach(i = 1:ncol(arr),
  .combine = cbind,
  .packages = c('data.table', 'doParallel')) %dopar% {
    if (median(arr[,i]) == 0) {
      arr[,i][!isOutlier(arr[,i])] <- NA
    } else {
      arr[,i][isOutlier(arr[,i])] <- NA
    }
    if (matrixStats::count(!is.na(arr[,i])) < num) {
      arr[,i] <- rep(NA, nrow(arr))
    }
    arr[,i]
  }
}

get_top_k_corr <- function(mat, k) {
  temp_ind <- c()
  for (i in 1:ncol(mat)) {
    temp_ind <- c(temp_ind, order(abs(mat[,i]), decreasing=T)[1:k])
  }
  top_corr_ind <- unique(temp_ind)
  top_corr <- rename_gene_ids(data.frame(mat[top_corr_ind,]))
  top_corr
}

get_gene_id <- function(g_name, spe) {
  gene_id = rowData(spe)[rowData(spe)$gene_name == g_name,]$gene_id
  gene_id
}

add_gene_col_data_df <- function(gene_id, g_name, data) {
  if (!(g_name %in% colnames(data))) {
    data[,ncol(data) + 1] <- data.frame(logcounts_151673_filtered[,gene_id])
    colnames(data)[ncol(data)] <- g_name
  }
  data
}

drop_cols_data_df <- function(drop, data) {
  data <- data[,!(names(data) %in% drop)]
  data
}

range01 <- function(x){(x-min(x))/(max(x)-min(x))}
```

```{r}
g_name <- 'MOBP'
gene_id <- get_gene_id(g_name, spe)
data <- add_gene_col_data_df(gene_id, g_name, data)
```

```{r}
#temp <- data.frame(spatialCoords(spe_sub), vec1)
g_name='vec1'
temp <- data.frame(spatialCoords(spe_sub), vec1)
print(ggplot(temp, aes(x = temp[,'pxl_row_in_lowres'], y=600 - temp[,'pxl_col_in_lowres'])) + geom_point(aes(color=temp[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
```

```{r}
for (g_name in colnames(data[6:12])) {
  print(ggplot(data, aes(x = data[,'scaled_pxl_row_in_lowres'], y=600 - data[,'scaled_pxl_col_in_lowres'])) + geom_point(aes(color=data[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
}
```

```{r}
top_corr
```

```{r}
mat <- compute_corr_matrix(logcounts_151673_filtered, data[6:12])
top_corr <- get_top_k_corr(mat, 10)
plot_corr_matrix(data.frame(top_corr))
```

```{r}
for (g_name in rownames(top_corr)) {
  gene_id <- get_gene_id(g_name, spe)
  data <- add_gene_col_data_df(gene_id, g_name, data)
}
data
```

```{r}
for (g_name in colnames(data[13:ncol(data)])) {
  print(paste(" Number of expressed barcodes for ", g_name, ": ",sum(data[,g_name] > 0), "/", length(data[,g_name])))
  print(ggplot(data, aes(x = data[,'scaled_pxl_row_in_lowres'], y=600 - data[,'scaled_pxl_col_in_lowres'])) + geom_point(aes(color=data[,g_name])) + theme(aspect.ratio = 1) + scale_color_gradient(low="pink", high="purple") + ggtitle(paste("Spot Plot ",g_name)))
}
```

```{r}
logcounts_151673[,i] > 0
sum(logcounts_151673_filtered[,gene_id] > 0)
sum(!is.na(logcounts_151673_filtered[,gene_id]))

compute_corr_matrix(data.frame(logcounts_151673_filtered), data[,6:12])
compute_corr_matrix(data.frame(logcounts_151673_filtered_no_outliers[,gene_id]), temp_data[,6:12])
```

```{r}
g_name = 'SNAP25'
gene_id = rowData(spe)[rowData(spe)$gene_name == 'SNAP25',]$gene_id
print(sum(logcounts_151507[,gene_id] != 0))
print(sum(logcounts_151507_filtered[,gene_id] != 0))
print(sum(!is.na(logcounts_151507_filtered_no_outliers[,gene_id])))

mat_with_outliers <- compute_corr_matrix(data.frame(logcounts_151507_filtered[,gene_id]), data[,4:5])
mat_with_outliers

mat_outliers <- compute_corr_matrix(data.frame(logcounts_151507_filtered_no_outliers[,gene_id]), data[,4:5])
mat_outliers
```

```{r, message=FALSE}
# Call remove outlier function
logcounts_temp <- as.matrix(logcounts_151673_filtered)
col_names <- colnames(logcounts_temp)
logcounts_151673_filtered_no_outliers <- remove_outliers(logcounts_temp, 100)
colnames(logcounts_151673_filtered_no_outliers) <- col_names

# Remove columns that are completely NA
not_all_na <- function(x) any(!is.na(x))
logcounts_151673_filtered_no_outliers <- data.frame(logcounts_151673_filtered_no_outliers) %>% select(where(not_all_na))

# compute correlation matrix
mat <- compute_corr_matrix(logcounts_151673_filtered_no_outliers, data[6:12])
```

```{r, message=FALSE}
top_corr <- get_top_k_corr(mat, 10)
plot_corr_matrix(top_corr)
```

```{r}
# MOBP, SNAP25, plot 
# Linear regression model
# Where do MOBP/SNAP25 fall in my analysis
# When did it get removed, spot plots of these genes
# spot plots of top genes in correlation analysis
gene_id = rowData(spe)[rowData(spe)$gene_name == 'LFNG',]$gene_id
sum(!is.na(logcounts_151507_filtered_no_outliers[,gene_id]))
```

```{r, message=FALSE}
plot_all_channels('LFNG', logcounts_151507_filtered_no_outliers, color_arr, remove_outliers=FALSE)
```

```{r, message=FALSE}
plot_all_channels('ELK4', logcounts_151507_filtered_no_outliers, color_arr, remove_outliers=FALSE)
```

```{r, message=FALSE}
plot_all_channels('PLIN4', logcounts_151507_filtered_no_outliers, color_arr, remove_outliers=FALSE)
```