cancer-biomarker-discover / Git / [6df130] /notebooks/scRNAseq

Models:
DanielG/
cancer-biomarker-discover
Downloads: 1
[6df130]: / notebooks / scRNAseq_analysis.Rmd
History
Download this file
749 lines (624 with data), 27.5 kB

---
title: "Analysis of scRNA-seq data"
date: "Last modified: `r format(Sys.time(), '%B %d, %Y')`"
tags: [scRNA-seq, seurat, melanoma, immunotherapy, PBMCs] 
output:
  html_document:
    theme: flatly
    highlight: zenburn
    toc: true
    number_sections: false
    toc_depth: 2
    toc_float:
      collapsed: false
      smooth_scroll: true
    code_folding: hide
    self_contained: yes
---

# Project Description

This project aims to identify therapeutic targets for melanoma patients by following a 2-phase approach:

-   Phase I: Analyze scRNA-seq data to identify DE genes and infer enriched pathways\
-   Phase II: Compare responders vs. non-responders to identify therapeutic targets

## Dataset Description

This analysis integrates five complementary single-cell RNA sequencing datasets:

1. Dataset 1 (Treatment Response Dataset)
   - Contains treatment response data
   - Includes responder/non-responder classifications
   - Focus on immunotherapy outcomes

2. Dataset 2 (Tumor Cell Dataset)
   - Characterizes tumor cell heterogeneity
   - Includes cell type annotations
   - Focus on tumor cell populations

3. Dataset 3 (Tumor Microenvironment Dataset)
   - Maps the tumor microenvironment
   - Contains malignant/non-malignant classifications
   - Includes detailed immune cell typing

4. Dataset 4 (Control Melanocyte Dataset)
   - Normal melanocyte control data
   - Reference for non-malignant state
   - Baseline expression profiles

5. Dataset 5 (Control Immune Dataset)
   - Normal immune cell control data
   - Reference for immune cell states
   - Baseline immune profiles

## Strategy

Use `seurat` and `harmony` to complete the following tasks:

1.  Load/process/combine data\
2.  Clustering and biomarker determination\
3.  Comparisons: disease vs. control

# Initialize environment

Install required packages.

```{r install_packages, results="hide", message=F, warning=F, error=F}

# Define packages to install
pkg.list = c('svDialogs', 'vroom', 'dplyr', 'DT', 'Seurat', 'harmony', 
             'patchwork', 'ggplot2', 'ggrepel', 'openxlsx', 'progress')

# Define packages not already installed
pkg.install <- pkg.list[!(pkg.list %in% installed.packages()[, 'Package'])]

# Install uninstalled packages
if (length(pkg.install) > 0) {
  install.packages(pkg.install)
}

```

Load installed packages.

```{r load_packages, results="hide", message=F, warning=F, error=F}

# Load packages
library(svDialogs)    # for prompting user-input
library(vroom)        # for quickly reading data
library(dplyr)        # for data processing
library(DT)           # to display datatables
library(Seurat)       # for scRNA-seq analysis
library(harmony)      # to integration scRNA-seq data
library(patchwork)    # for combining plots
library(ggplot2)      # for data visualization
library(ggrepel)      # to use geom_pont_repel()
library(openxlsx)     # to write data to excel
library(progress)     # to display progress bar

```

Define custom functions.

```{r custom_functions, results="hide", message=F, warning=F, error=F}

# qc_filter(): filter data based on QC for scRNA-seq data
qc_filter <- function(obj, feat.t=c(200, 2500), pct.mt.t=5, var.method='vst', 
                      feat.n=2000, qc.plot=T, top.n=10, title='') {
  
  ############################ FUNCTION DESCRIPTION ############################
  # feat.t = lower and upper limits on unique gene counts
  # pct.mt.t = threshold of level in mitochondrial contamination
  # var.method = method for selecting highly variable genes
  # feat.n = number of variable genes to select
  # qc.plot = boolean whether to generate plots to decide downstream analyses
  # title = string to use for plot title
  ############################## BEGIN FUNCTION ################################
  
  # determine percentage of mitochondrial contamination
  obj[['pct.mt']] <- PercentageFeatureSet(obj, pattern='^MT-')
  # filter + nomalize + scale data
  obj <- obj %>% 
    subset(., subset=(nFeature_RNA > feat.t[1]) & (nFeature_RNA < feat.t[2]) & 
             (pct.mt < pct.mt.t)) %>% NormalizeData(.) %>% 
    FindVariableFeatures(., selection.method=var.method) %>% 
    ScaleData(.) %>% RunPCA(., features=VariableFeatures(object=.))
  # generate follow-up QC plots (if prompted)
  if (qc.plot) { 
    p1 <- VariableFeaturePlot(obj) %>% 
      LabelPoints(plot=., points=head(VariableFeatures(obj), top.n), repel=T)
    p2 <- ElbowPlot(obj)
    plot(p1 + p2 + plot_annotation(title=title))
  }
  # return output
  return(obj)
}

# de_analyze(): conduct differential expression (DE) analysis
de_analyze <- function(m1, m2, alt='two.sided', paired=F, var.equal=F, 
                       adj.method='bonferroni', t=0.05, de.plot=F, title='') { 
  
  ############################ FUNCTION DESCRIPTION ############################
  # m1, m2 = expression matrices to compare
  # alt, paired, var.equal = arguments for t.test() function
  # adj.method = method for calculating adjusted p-value
  # t = threshold for significance 
  # de.plot = boolean whether to generate a volcano plot
  ############################## BEGIN FUNCTION ################################
  
  # make sure two matrices have same number of rows
  if (nrow(m1) != nrow(m2)) { 
    stop('Row length does not match between the provided matrices.')
  }
  # make sure gene names align between matrices
  if (!all(rownames(m1) == rownames(m2))) { 
    stop('Gene names do not align between provided matrices.')
    }
  # instantiate output variable
  results <- data.frame(gene=rownames(m1), 
                        t.stat=vector(mode='numeric', length=nrow(m1)), 
                        p.val=vector(mode='numeric', length=nrow(m1)))
  # conduct unpaired t-test with unequal variance for each gene
  pb <- progress_bar$new(
    format='  analyzing [:bar] :percent time left: :eta', total=nrow(m1))
  for (i in 1:nrow(m1)) { 
    pb$tick()
    x <- m1[i, ]; y <- m2[i, ]
    r <- t.test(x, y, alternative=alt, paired=paired, var.equal=var.equal)
    results$t.stat[i] <- r$statistic
    results$p.val[i] <- r$p.value
  }
  # determine adjusted p-values
  results$q.val <- p.adjust(results$p.val, method=adj.method)
  # add additional fields
  results <- results %>%
    mutate(Significance=case_when(q.val < t & t.stat > 0 ~ 'Up',
                                  q.val < t & t.stat < 0 ~ 'Down',
                                  T ~ 'NS')) %>% arrange(q.val)
  # generate volcano plot (if prompted)
  if (de.plot) { 
    p <- results %>% arrange(t.stat) %>% ggplot(data=., 
           aes(x=t.stat, y=-log10(q.val), col=Significance, label=gene)) +
      geom_point() + geom_text_repel() + theme_minimal() + 
      scale_color_manual(values=c('blue', 'black', 'red')) + ggtitle(title)
    plot(p)
  }
  # return output
  return(results)
}

```

Additional settings.

```{r settings}

# Adjust system settings
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 2)

# Save plots? (default: F)
save.plots <- dlgInput('Save all outputs? (T/F)', F)$res

```

# Tasks

## 1. Load/process/combine data

**Description**: load all scRNA-seq files.

```{r load_data, warning=F, message=F}
setwd("..")
# Load workspace (if it exists)
f <- list.files()
if (any(endsWith(f, '.RData'))) {
  load(f[endsWith(f, '.RData')][1])
}
```

```{r load_data, warning=F, message=F}
##################### ONLY RUN IF THERE'S NO WORKSPACE #########################
# # Load disease datasets
# obj <- list(); meta <- list()
# #   Dataset 1: Treatment response dataset
# meta$dataset1 <- read.delim(file='data/dataset1/metadata.txt')
# x <- vroom('data/dataset1/counts.txt', col_names=F)
# x <- x[, 1:nrow(meta$dataset1)] %>% as.matrix(.)
# y <- read.delim('data/dataset1/genes.txt', header=F)
# rownames(x) <- make.unique(y$V1)
# colnames(x) <- meta$dataset1$ID
# obj$dataset1 <- CreateSeuratObject(x, project='dataset1',
#                                     min.cells=3, min.features=200)
# obj$dataset1$Treatment <- as.factor(meta$dataset1$Treatment)
# obj$dataset1$Response <- as.factor(meta$dataset1$Response)
# rm(x, y); gc()
# 
# #   Dataset 2: Tumor cell dataset
# meta$dataset2 <- read.csv('data/dataset2/metadata.csv')
# x <- vroom('data/dataset2/expression_matrix.csv')
# y <- as.matrix(x[, -1]); rownames(y) <- x$...1
# obj$dataset2 <- CreateSeuratObject(y, project='dataset2',
#                                     min.cells=3, min.features=200)
# obj$dataset2$type <- as.factor(meta$dataset2$cell.types)
# 
# #   Dataset 3: Tumor microenvironment dataset
# x <- vroom('data/dataset3/expression_matrix.txt')
# y <- as.matrix(x[-c(1:3), -1]); rownames(y) <- x$Cell[-c(1:3)]
# obj$dataset3 <- CreateSeuratObject(y, project='dataset3',
#                                    min.cells=3, min.features=200)
# meta$dataset3 <- data.frame(Tumor=t(x[1, -1]),
#                             Malignant=case_when(x[2, -1] == 1 ~ 'No',
#                                                 x[2, -1] == 2 ~ 'Yes',
#                                                 x[2, -1] == 0 ~ 'Unresolved'),
#                             NMType=case_when(x[3, -1] == 1 ~ 'T',
#                                              x[3, -1] == 2 ~ 'B',
#                                              x[3, -1] == 3 ~ 'Macro',
#                                              x[3, -1] == 4 ~ 'Endo',
#                                              x[3, -1] == 5 ~ 'CAF',
#                                              x[3, -1] == 6 ~ 'NK',
#                                              x[3, -1] == 0 ~ 'NA'))
# obj$dataset3$Malignant <- meta$dataset3$Malignant
# obj$dataset3$NMType <- meta$dataset3$NMType
# 
# # Load control datasets
# #   Dataset 4: Control melanocyte dataset
# x <- vroom('data/dataset4/expression_matrix.csv')
# y <- as.matrix(x[, -1]); rownames(y) <- x$...1
# obj$dataset4 <- CreateSeuratObject(y, project='dataset4',
#                                     min.cells=3, min.features=200)
# 
# #   Dataset 5: Control immune dataset
# meta$dataset5 <- read.table('data/dataset5/metadata.tsv',
#                              sep='\t', header=T)
# x <- vroom('data/dataset5/matrix.tsv')
# y <- as.matrix(x[, -1]); rownames(y) <- x$Gene.Name
# obj$dataset5 <- CreateSeuratObject(y, project='dataset5',
#                                     min.cells=3, min.features=200)
# obj$dataset5$sample <- as.factor(meta$dataset5$sample)
# rm(f, x, y, meta); gc()

```

*Description*: Combine all datasets together into single Seurat object. Also apply `harmony` to remove clustering bias based on dataset source.

```{r combine_data, warning=F, message=F}
file_path <- "/Users/scampit/Projects/torchstack/cancer-biomarker-discovery"
setwd(file_path)

# Combine all datasets
if (!exists('data.all')) { 
  data.all <- list()
}
if (!('raw' %in% names(data.all))) { 
  data.all$raw <- merge(obj$dataset1, obj$dataset2) %>%
    merge(., obj$dataset3) %>% merge(., obj$dataset4) %>% merge(., obj$dataset5)
}

# Add source information
if (!('Source' %in% names(data.all$raw@meta.data))) { 
  data.all$raw$Source <- c(rep(names(obj)[1], ncol(obj$dataset1)),
                           rep(names(obj)[2], ncol(obj$dataset2)),
                           rep(names(obj)[3], ncol(obj$dataset3)),
                           rep(names(obj)[4], ncol(obj$dataset4)),
                           rep(names(obj)[5], ncol(obj$dataset5)))
}

# Apply QC
if (!('proc' %in% names(data.all))) { 
  data.all$proc <- qc_filter(data.all$raw)
}

# Visualize integration results
p1 <- DimPlot(object=data.all$proc, reduction='pca', pt.size=0.1, 
              group.by='Source')
p2 <- VlnPlot(object=data.all$proc, features='PC_1', pt.size=0.1, 
              group.by='Source')
p1 + p2
if (save.plots) { 
  ggsave('plots/QC_no_harmony.pdf', width=10, height=5, units='in')
}

# Apply harmony (to remove clustering based on dataset source)
if (!('harmony' %in% names(data.all$proc@reductions))) { 
  data.all$proc <- data.all$proc %>% RunHarmony('Source', plot_convergence=T)
}

# Visualize integration results (after harmony)
p1 <- DimPlot(object=data.all$proc, reduction='harmony', pt.size=0.1, 
              group.by='Source')
p2 <- VlnPlot(object=data.all$proc, features='harmony_1', pt.size=0.1, 
              group.by='Source')
p1 + p2
if (save.plots) { 
  ggsave('plots/QC_w_harmony.pdf', width=10, height=5, units='in')
}

```

## 2. Clustering and biomarker determination

**Description**: cluster cells based on UMAP and determine biomarkers based on cluster assignment.

### Clustering

```{r clustering, warning=F, message=T}

# Visualize UMAP plots (runtime: ~5 minutes)
if (!('umap' %in% names(data.all$proc@reductions))) { 
  data.all$proc <- data.all$proc %>% 
      RunUMAP(reduction='harmony', dims=1:20) %>% 
      FindNeighbors(reduction='harmony', dims=1:20) %>% 
      FindClusters(resolution=0.5) %>% identity()
}

#   by dataset source
p <- DimPlot(data.all$proc, reduction='umap', group.by='Source', pt.size=0.1, 
        split.by='Source') + ggtitle('UMAP split by dataset source'); plot(p)
#   by cluster (unlabeled)
p <- DimPlot(data.all$proc, reduction='umap', label=T, pt.size=0.1) + 
  ggtitle('UMAP of combined scRNA-seq data (unlabeled)'); plot(p)
if (save.plots) { 
  ggsave('plots/UMAP_unlabeled.pdf', width=10, height=5, units='in')
}

#   by cluster (labeled)
lab <- c('Tumor cell', 'Technical error 1', 'Cytotoxic CD8 T cell 1', 'B cell', 
         'Melanocytes 1', 'Macrophage', 'Cytotoxic CD8 T cell 2', 'Technical error 2', 
         'Memory T cell', 'Melanocytes 2', 'Dysfunctional CD8 T cell', 
         'Cytotoxic CD8 T cell 3', '? 1', 'Melanocytes 3', 'Activated cell', 
         'Exhausted T cell', 'Cytotoxic T cell', '? 2', '? 3', 'Melanocytes 4', '? 4')
names(lab) <- levels(data.all$proc)
plot.data <- data.all$proc %>% RenameIdents(., lab)
p <- DimPlot(plot.data, reduction='umap', label=T, pt.size=0.1) + 
  ggtitle('UMAP of combined scRNA-seq data (labeled)'); plot(p)
if (save.plots) { 
  ggsave('plots/UMAP_labeled.pdf', width=10, height=5, units='in')
}

```

### Biomarkers

```{r biomarkers, warning=F, message=T}

# Find all biomarkers based on clustering (runtime: ~30 minutes)
if (!('bm' %in% names(data.all))) { 
  data.all$bm <- FindAllMarkers(data.all$proc, min.pct=0.25, logfc.threshold=0.25)
}

# View table of top 3 biomarkers for each cluster
d <- data.all$bm %>% group_by(cluster) %>% slice_max(n=3, order_by=avg_log2FC)
datatable(d)

# Visualize ridge plots based on biomarkers of interest
ridge.feat <- dlg_list(sort(unique(d$gene)), multiple=T)$res
p <- RidgePlot(data.all$proc, features=ridge.feat, 
          ncol=ceiling(length(ridge.feat) / 2)); plot(p)
if (save.plots) { 
  ggsave('plots/biomarker_ridge_plots.pdf', width=10, height=10, units='in')
}

```

## 3. Comparisons: disease vs. control

**Description**: determine differentially expressed genes (DEGs) between disease vs. control groups.

### Case: tumor

```{r DE_tumor, warning=F, message=T}

# Instantiate DE variable
if (!exists('de')) { 
  de <- list()
}
if (!('data' %in% names(de))) { 
  de$data <- GetAssayData(object=data.all$proc, slot='data')
}

# Define datasets to compare
ix <- intersect(rownames(obj$dataset3), rownames(obj$dataset2)) %>%
  intersect(., rownames(obj$dataset4))
jx1 <- data.all$proc$Source == names(obj)[3] | data.all$proc$Malignant %in% 'Yes'
jx2 <- data.all$proc$Source == names(obj)[4]
x <- de$data[ix, jx1]; y <- de$data[ix, jx2]

# DE analysis (runtime: ~15 minutes)
if (!('tumor' %in% names(de))) { 
  de$tumor <- de_analyze(x, y) %>% na.omit
}

# Visualize dot plot
dot.feat <- dlg_list(de$tumor$gene[de$tumor$q.val < 0.05], multiple=T)$res
p.data <- data.all$proc 
Idents(p.data) <- as.factor(case_when(jx1 ~ 'Disease', jx2 ~ 'Control', T ~ 'NA'))
p <- p.data %>% 
  DotPlot(object=., features=dot.feat, idents=c('Disease', 'Control')) + 
  ggtitle('Case: tumor')
plot(p)
if (save.plots) { 
  ggsave('plots/DE_tumor.pdf', width=10, height=5, units='in')
}

```

### Case: immune cells (bulk)

```{r DE_immune_bulk,warning=F, message=T}

# Define datasets to compare
ix <- intersect(rownames(obj$dataset1), rownames(obj$dataset2)) %>% 
  intersect(., rownames(obj$dataset5))
jx1 <- data.all$proc$Source == names(obj)[1] | data.all$proc$Malignant %in% 'No'
jx2 <- data.all$proc$Source == names(obj)[5]
x <- de$data[ix, jx1]; y <- de$data[ix, jx2]

# DE analysis (runtime: ~20 minutes)
if (!('immune' %in% names(de))) { 
  de$immune <- de_analyze(x, y) %>% na.omit()
}

# Visualize dot plot
dot.feat <- dlg_list(de$immune$gene[de$immune$q.val < 0.05], multiple=T)$res
p.data <- data.all$proc 
Idents(p.data) <- as.factor(case_when(jx1 ~ 'Disease', jx2 ~ 'Control', T ~ 'NA'))
p <- p.data %>% 
  DotPlot(object=., features=dot.feat, idents=c('Disease', 'Control')) + 
  ggtitle('Case: immune cells')
plot(p)
if (save.plots) { 
  ggsave('plots/DE_immune_bulk.pdf', width=10, height=5, units='in')
}

```

### Case: immune cells (cluster-based)

```{r DE_immune_cluster, warning=F, message=T}

# Define datasets to compare (runtime: ~5 minutes)
immune.cells <- c('B', 'CD8', 'Macrophage', 'Memory', 'Activated')
for (i in 1:length(immune.cells)) { 
  # Define row + column indices
  ix <- intersect(rownames(obj$dataset1), rownames(obj$dataset2)) %>% 
    intersect(., rownames(obj$dataset5))
  c <- grepl(immune.cells[i], Idents(plot.data))
  jx1 <- (plot.data$Source == names(obj)[1] | 
            data.all$proc$Malignant %in% 'No') & c
  jx2 <- data.all$proc$Source == names(obj)[5] & c
  x <- de$data[ix, jx1]; y <- de$data[ix, jx2]
  # DE analysis
  if (!(immune.cells[i] %in% names(de))) { 
    de[[immune.cells[i]]] <- de_analyze(x, y) %>% na.omit()
  }
  # Visualize dot plot
  x <- de[[immune.cells[[i]]]]
  dot.feat <- x$gene[x$q.val < 0.05][1:5]
  p.data <- data.all$proc 
  Idents(p.data) <- as.factor(case_when(jx1 ~ 'Disease', jx2 ~ 'Control', T ~ 'NA'))
  p <- p.data %>% 
    DotPlot(object=., features=dot.feat, idents=c('Disease', 'Control')) + 
    ggtitle(paste0('Case: ', immune.cells[i]))
  plot(p)
  if (save.plots) { 
    ggsave(paste0('plots/DE_', immune.cells[i], '.pdf'), 
           width=10, height=5, units='in')
  }
}

```

# Case: responder vs. non-responder
ix <- rownames(obj$dataset1)
jx1 <- data.all$proc$Response %in% 'Responder'
jx2 <- data.all$proc$Response %in% 'Non-responder'

# Save outputs and working space (if prompted)

```

```{r glycolysis_visualization, warning=F, message=T}
# Define key glycolytic genes
glycolysis_genes <- c(
  "HK1", "HK2", "GPI", "PFKL", "PFKM", "PFKP",  # Glucose uptake and early glycolysis
  "ALDOA", "ALDOB", "ALDOC",                     # Aldolase isoforms
  "TPI1",                                        # Triose phosphate isomerase
  "GAPDH",                                       # Glyceraldehyde phosphate dehydrogenase
  "PGK1",                                        # Phosphoglycerate kinase
  "PGAM1",                                       # Phosphoglycerate mutase
  "ENO1", "ENO2",                               # Enolase isoforms
  "PKM", "PKLR",                                # Pyruvate kinase isoforms
  "LDHA", "LDHB",                               # Lactate dehydrogenase
  "SLC2A1", "SLC2A3"                            # Glucose transporters (GLUT1, GLUT3)
)

# Filter for genes present in the dataset
glycolysis_genes_present <- intersect(glycolysis_genes, rownames(de$data))

# Create anonymous gene IDs
anon_genes <- paste0("gene_", seq_along(glycolysis_genes_present))
names(anon_genes) <- glycolysis_genes_present

# Create mapping dataframe for reference
gene_mapping <- data.frame(
  Original_Gene = glycolysis_genes_present,
  Anonymous_ID = anon_genes[glycolysis_genes_present]
)

# Write mapping to file if saving is enabled
if (save.plots) {
  write.csv(gene_mapping, 'results/glycolysis_gene_mapping.csv', row.names = FALSE)
}

# Create dot plot for glycolytic genes
p.data <- data.all$proc
Idents(p.data) <- as.factor(case_when(jx1 ~ 'Disease', jx2 ~ 'Control', T ~ 'NA'))

# Generate dot plot with anonymous IDs and custom colors
p_glycolysis <- DotPlot(p.data, 
                        features = glycolysis_genes_present,
                        idents = c('Disease', 'Control')) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab('Genes') +
  ylab('Condition') +
  scale_color_gradient2(low = "navy", mid = "gray90", high = "red", 
                       midpoint = 0, name = "Average\nExpression") +
  scale_x_discrete(labels = anon_genes[glycolysis_genes_present])

# Display plot
print(p_glycolysis)

# Save plot if requested
if (save.plots) {
  ggsave('plots/DE_tumor_glycolysis.pdf', 
         plot = p_glycolysis, 
         width = 12, 
         height = 6, 
         units = 'in')
}

# Calculate statistics for glycolytic genes
glycolysis_stats <- de$tumor[de$tumor$gene %in% glycolysis_genes_present, ]
glycolysis_stats <- glycolysis_stats[order(glycolysis_stats$q.val), ]

# Add anonymous IDs to statistics
glycolysis_stats$anonymous_id <- anon_genes[glycolysis_stats$gene]

# Display statistics with anonymous IDs
print("Differential Expression Statistics for De-identified Glycolytic Genes:")
glycolysis_stats$gene <- glycolysis_stats$anonymous_id
print(glycolysis_stats)

# Count significantly different genes
sig_glyco_genes <- sum(glycolysis_stats$q.val < 0.05)
print(sprintf("Number of significantly different glycolytic genes (q < 0.05): %d", sig_glyco_genes))
```

```{r glycolysis_boxplots, warning=F, message=T}
# Get expression data for glycolytic genes
expr_data <- GetAssayData(p.data, slot = "data")[glycolysis_genes_present,]

# Filter out genes with all zeros
nonzero_genes <- rownames(expr_data)[rowSums(expr_data > 0) > 0]
expr_data <- expr_data[nonzero_genes,]

# Update anonymous gene IDs for non-zero genes
anon_genes <- anon_genes[nonzero_genes]

target_genes <- c("g10", "g21", "g03")
selected_genes <- names(anon_genes)[anon_genes %in% target_genes]

# Create a data frame for plotting and store statistical results
plot_data <- data.frame()
stat_results <- data.frame(
  Gene = character(),
  P_value = numeric(),
  T_statistic = numeric(),
  Mean_diff = numeric(),
  stringsAsFactors = FALSE
)

for (gene in selected_genes) {
  # Get expression values for current gene
  expr_values <- expr_data[gene,]
  
  # Split values by condition before centering
  disease_vals <- expr_values[colnames(expr_data) %in% colnames(de$data)[jx1]]
  control_vals <- expr_values[colnames(expr_data) %in% colnames(de$data)[jx2]]
  
  # Perform t-test on raw values
  t_test_result <- t.test(disease_vals, control_vals)
  
  # Center the expression values by subtracting the overall mean
  overall_mean <- mean(expr_values)
  disease_vals_centered <- disease_vals - overall_mean
  control_vals_centered <- control_vals - overall_mean
  
  # Store statistical results
  stat_results <- rbind(stat_results, data.frame(
    Gene = anon_genes[gene],
    P_value = t_test_result$p.value,
    T_statistic = t_test_result$statistic,
    Mean_diff = mean(disease_vals) - mean(control_vals),
    Mean_centered_diff = mean(disease_vals_centered) - mean(control_vals_centered),
    stringsAsFactors = FALSE
  ))
  
  # Create temporary data frames for each condition with centered values
  temp_df <- data.frame(
    Expression = c(disease_vals_centered, control_vals_centered),
    Gene = anon_genes[gene],
    Condition = c(rep("Disease", length(disease_vals_centered)), 
                 rep("Control", length(control_vals_centered)))
  )
  
  # Append to main data frame
  plot_data <- rbind(plot_data, temp_df)
}

# Add FDR correction
stat_results$FDR <- p.adjust(stat_results$P_value, method = "BH")

# Sort results by p-value
stat_results <- stat_results[order(stat_results$P_value), ]

# Create boxplot with swarm plot
p_boxplot <- ggplot(plot_data, aes(x = Gene, y = Expression, fill = Condition)) +
  # Add swarm plot layer first (so it appears behind)
  geom_jitter(aes(color = Condition), position = position_jitterdodge(jitter.width = 0.2), 
              size = 0.3, alpha = 0.3) +
  # Add boxplot layer second (so it appears in front)
  geom_boxplot(outlier.shape = NA, alpha = 0.4, width = 0.7) +
  # Colors for boxplot fill and point colors
  scale_fill_manual(values = c("Disease" = "#E41A1C", "Control" = "#377EB8")) +
  scale_color_manual(values = c("Disease" = "#E41A1C", "Control" = "#377EB8")) +
  # Set y-axis limits
  ylim(-2, max(plot_data$Expression)) +
  # Theme customization
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.y = element_blank(),  # Remove y-axis major grid lines
    panel.grid.minor.y = element_blank(),  # Remove y-axis minor grid lines
    panel.grid.major.x = element_line(color = "gray90"),  # Keep x-axis major grid lines
    panel.grid.minor.x = element_blank(),  # Remove x-axis minor grid lines
    legend.position = c(0.95, 0.95),  # Position legend inside top-right
    legend.justification = c(1, 1),    # Anchor point for legend
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    legend.background = element_rect(fill = "white", color = NA),
    legend.box.background = element_rect(color = "gray90")
  ) +
  # Labels
  xlab("Genes") +
  ylab("Centered Expression") +
  ggtitle("Gene Expression Relative to Mean") +
  # Add horizontal line at y=0
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50", alpha = 0.5)

# Display plot
print(p_boxplot)

# Save plot if requested
if (save.plots) {
  ggsave('plots/DE_tumor_glycolysis_boxplots.pdf', 
         plot = p_boxplot, 
         width = 12, 
         height = 6, 
         units = 'in')
}

# Calculate summary statistics for centered data
summary_stats <- plot_data %>%
  group_by(Gene, Condition) %>%
  summarise(
    Mean = mean(Expression),
    Median = median(Expression),
    SD = sd(Expression),
    Q1 = quantile(Expression, 0.25),
    Q3 = quantile(Expression, 0.75),
    n = n(),
    .groups = 'drop'
  )

# Display summary statistics
print("Summary Statistics for Centered Expression by Condition:")
print(summary_stats)

# Display statistical test results
print("\nStatistical Test Results (Disease vs Control):")
print(stat_results)

# Print number of genes removed due to zero expression
n_removed <- length(glycolysis_genes_present) - length(nonzero_genes)
if (n_removed > 0) {
  print(sprintf("\nRemoved %d genes with zero expression", n_removed))
  print("Removed genes:")
  print(setdiff(glycolysis_genes_present, nonzero_genes))
}

# Print number of significantly different genes
sig_genes <- sum(stat_results$FDR < 0.05)
if (sig_genes > 0) {
  print(sprintf("\nNumber of significantly different genes (FDR < 0.05): %d", sig_genes))
  print("Significant genes:")
  print(stat_results[stat_results$FDR < 0.05, ])
}
```