|
a |
|
b/Preprocessing.Rmd |
|
|
1 |
--- |
|
|
2 |
title: "Preprocessing" |
|
|
3 |
output: |
|
|
4 |
html_document: |
|
|
5 |
df_print: paged |
|
|
6 |
editor_options: |
|
|
7 |
chunk_output_type: console |
|
|
8 |
--- |
|
|
9 |
|
|
|
10 |
# Preprocessing Data (Following OSTA Analysis Pipeline) |
|
|
11 |
|
|
|
12 |
In this markdown, analysis and quality control steps are performed following chapters 8 and 9 of the book 'Orchestrating Spatially-Resolved Transcriptomics Analysis with Bioconductor' <https://lmweber.org/OSTA-book/>. Data is fetched and plotted, then QC metrics are calculated and data is filtered based on QC thresholds for a single sample. |
|
|
13 |
|
|
|
14 |
### Overview of Data |
|
|
15 |
|
|
|
16 |
Data is sourced from spatialLIBD (<http://spatial.libd.org/spatialLIBD/>), a project containing LIBD human dorsolateral pre-frontal cortex (DLPFC) spatial transcriptomics data from the 10x Genomics Visium Platform (<https://10xgenomics.com/products/spatial-gene-expression>). The data contains 12 lowres samples each with a scale factor of 0.0450045. |
|
|
17 |
|
|
|
18 |
### Importing Libraries |
|
|
19 |
|
|
|
20 |
```{r Import Libraries, message=FALSE} |
|
|
21 |
library(spatialLIBD) |
|
|
22 |
library(ggspavis) |
|
|
23 |
library(scater) |
|
|
24 |
library(ggplot2) |
|
|
25 |
``` |
|
|
26 |
|
|
|
27 |
### Helper Functions for QC |
|
|
28 |
|
|
|
29 |
The following functions are called by the driver code to perform plotting and QC operations. |
|
|
30 |
|
|
|
31 |
```{r Helper Functions for QC, message=FALSE} |
|
|
32 |
#' Plots a SpatialExperiment object by transforming the image into a rasterGrob object and calling ggplot |
|
|
33 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
34 |
#' @return None |
|
|
35 |
plot_image <- function(spe_sub) { |
|
|
36 |
# get histology image |
|
|
37 |
img <- SpatialExperiment::imgRaster(spe_sub) |
|
|
38 |
|
|
|
39 |
## Transform to a rasterGrob object |
|
|
40 |
grob <- grid::rasterGrob(img, width = grid::unit(1, "npc"), height = grid::unit(1, "npc")) |
|
|
41 |
|
|
|
42 |
# Create df for plotting |
|
|
43 |
sample_df <- as.data.frame(colData(spe_sub), optional = TRUE) |
|
|
44 |
|
|
|
45 |
## Make a plot using geom_spatial |
|
|
46 |
p <- ggplot2::ggplot( |
|
|
47 |
sample_df, |
|
|
48 |
ggplot2::aes( |
|
|
49 |
x = pxl_col_in_fullres * SpatialExperiment::scaleFactors(spe_sub), |
|
|
50 |
y = pxl_row_in_fullres * SpatialExperiment::scaleFactors(spe_sub), |
|
|
51 |
) |
|
|
52 |
) + |
|
|
53 |
geom_spatial( |
|
|
54 |
data = tibble::tibble(grob = list(grob)), |
|
|
55 |
ggplot2::aes(grob = grob), |
|
|
56 |
x = 0.5, |
|
|
57 |
y = 0.5 |
|
|
58 |
) |
|
|
59 |
|
|
|
60 |
## Show the plot |
|
|
61 |
print(p) |
|
|
62 |
p |
|
|
63 |
} |
|
|
64 |
|
|
|
65 |
#' Identifies mitochondrial genes, and adds per spot QC metrics including mitochondrial genes as a feature subset |
|
|
66 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
67 |
#' @return spe_sub Spatial Experiment object with added QC metrics |
|
|
68 |
add_QC_metrics <- function(spe_sub) { |
|
|
69 |
# identify mitochondrial genes |
|
|
70 |
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe_sub)$gene_name) |
|
|
71 |
|
|
|
72 |
# add per spot QC metrics, including mitochondrial genes as a feature subset |
|
|
73 |
spe_sub <- addPerCellQC(spe_sub, subsets = list(mito = is_mito)) |
|
|
74 |
spe_sub |
|
|
75 |
} |
|
|
76 |
|
|
|
77 |
#' Sets a lower filtering threshold for library size, plots QC plot and spatial pattern of spots |
|
|
78 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
79 |
#' @param threshold lower limit on library size |
|
|
80 |
#' @return spe_sub Spatial Experiment object with added col data after lib size thresholding |
|
|
81 |
threshold_lib_size <- function(spe_sub, threshold) { |
|
|
82 |
# continuous variable containing total number of genes at each spot |
|
|
83 |
hist(colData(spe_sub)$sum, breaks = 20) |
|
|
84 |
|
|
|
85 |
# set filtering threshold for library size |
|
|
86 |
print(plotQC(spe_sub, type = "scatter", |
|
|
87 |
metric_x = "cell_count", metric_y = "sum", |
|
|
88 |
threshold_y = threshold)) |
|
|
89 |
|
|
|
90 |
# check number of spots below threshold |
|
|
91 |
qc_lib_size <- colData(spe_sub)$sum < threshold |
|
|
92 |
print(table(qc_lib_size)) |
|
|
93 |
|
|
|
94 |
# look for spatial pattern in discarded spots |
|
|
95 |
colData(spe_sub)$qc_lib_size <- qc_lib_size |
|
|
96 |
|
|
|
97 |
# plot spots |
|
|
98 |
threshold_row <- colData(spe_sub)[colData(spe_sub)$qc_lib_size,][,'array_row'] |
|
|
99 |
threshold_col <- colData(spe_sub)[colData(spe_sub)$qc_lib_size,][,'array_col'] |
|
|
100 |
print(ggplot(data.frame(threshold_row, threshold_col), aes(x=threshold_row, y=threshold_col)) + geom_point() + ggtitle("Discarded Spots (Low Library Size)")) |
|
|
101 |
|
|
|
102 |
spe_sub |
|
|
103 |
} |
|
|
104 |
|
|
|
105 |
#' Sets a lower filtering threshold for number of expressed genes, plots QC plot and spatial pattern of spots |
|
|
106 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
107 |
#' @param threshold lower limit on number of expressed genes |
|
|
108 |
#' @return spe_sub Spatial Experiment object with added col data after expressed gene thresholding |
|
|
109 |
threshold_expressed_genes <- function(spe_sub, threshold) { |
|
|
110 |
# continuous variable containing total number of genes at each spot |
|
|
111 |
hist(colData(spe_sub)$detected, breaks = 20) |
|
|
112 |
|
|
|
113 |
# set filtering threshold for library size at 500 |
|
|
114 |
print(plotQC(spe_sub, type = "scatter", |
|
|
115 |
metric_x = "cell_count", metric_y = "detected", |
|
|
116 |
threshold_y = threshold)) |
|
|
117 |
|
|
|
118 |
# check number of spots below threshold |
|
|
119 |
qc_detected <- colData(spe_sub)$detected < threshold |
|
|
120 |
print(table(qc_lib_size)) |
|
|
121 |
|
|
|
122 |
# look for spatial pattern in discarded spots |
|
|
123 |
colData(spe_sub)$qc_detected <- qc_detected |
|
|
124 |
|
|
|
125 |
# plot spots |
|
|
126 |
threshold_row <- colData(spe_sub)[colData(spe_sub)$qc_detected,][,'array_row'] |
|
|
127 |
threshold_col <- colData(spe_sub)[colData(spe_sub)$qc_detected,][,'array_col'] |
|
|
128 |
print(ggplot(data.frame(threshold_row, threshold_col), aes(x=threshold_row, y=threshold_col)) + geom_point() + ggtitle("Discarded Spots (Low Expressed Genes)")) |
|
|
129 |
|
|
|
130 |
spe_sub |
|
|
131 |
} |
|
|
132 |
|
|
|
133 |
#' Sets an upper threshold on the proportion of mitochondrial reads |
|
|
134 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
135 |
#' @param threshold upper limit on proportional of mitochondrial reads |
|
|
136 |
#' @return spe_sub Spatial Experiment object with added col data after mitochondrial read thresholding |
|
|
137 |
threshold_mito_reads <- function(spe_sub, threshold) { |
|
|
138 |
# histogram of mitochondrial read proportions |
|
|
139 |
hist(colData(spe_sub)$subsets_mito_percent, breaks = 20) |
|
|
140 |
|
|
|
141 |
# plot mitochondrial read proportion vs. number of cells per spot |
|
|
142 |
print(plotQC(spe_sub, type = "scatter", |
|
|
143 |
metric_x = "cell_count", metric_y = "subsets_mito_percent", |
|
|
144 |
threshold_y = threshold)) |
|
|
145 |
|
|
|
146 |
# select QC threshold for mitochondrial read proportion |
|
|
147 |
qc_mito <- colData(spe_sub)$subsets_mito_percent > threshold |
|
|
148 |
print(table(qc_mito)) |
|
|
149 |
|
|
|
150 |
# look for spatial pattern in discarded spots |
|
|
151 |
colData(spe_sub)$qc_mito <- qc_mito |
|
|
152 |
|
|
|
153 |
# plot spots |
|
|
154 |
threshold_row <- colData(spe_sub)[colData(spe_sub)$qc_mito,][,'array_row'] |
|
|
155 |
threshold_col <- colData(spe_sub)[colData(spe_sub)$qc_mito,][,'array_col'] |
|
|
156 |
print(ggplot(data.frame(threshold_row, threshold_col), aes(x=threshold_row, y=threshold_col)) + geom_point() + ggtitle("Discarded Spots (High Mitochondrial Reads)")) |
|
|
157 |
|
|
|
158 |
spe_sub |
|
|
159 |
} |
|
|
160 |
|
|
|
161 |
#' Sets an upper threshold on the number of cells per spot |
|
|
162 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
163 |
#' @param threshold upper limit on number of cells per spot |
|
|
164 |
#' @return spe_sub Spatial Experiment object with added col data after cell number thresholding |
|
|
165 |
threshold_num_cells <- function(spe_sub, threshold) { |
|
|
166 |
# histogram of number of cells per spot |
|
|
167 |
hist(colData(spe_sub)$cell_count, breaks = 20) |
|
|
168 |
|
|
|
169 |
# distribution of cells per spot |
|
|
170 |
tbl_cells_per_spot <- table(colData(spe_sub)$cell_count) |
|
|
171 |
|
|
|
172 |
# plot number of expressed genes vs. number of cells per spot |
|
|
173 |
print(plotQC(spe_sub, type = "scatter", |
|
|
174 |
metric_x = "cell_count", metric_y = "detected", |
|
|
175 |
threshold_x = threshold)) |
|
|
176 |
|
|
|
177 |
# select QC threshold for number of cells per spot |
|
|
178 |
qc_cell_count <- colData(spe_sub)$cell_count > threshold |
|
|
179 |
print(table(qc_cell_count)) |
|
|
180 |
colData(spe_sub)$qc_cell_count <- qc_cell_count |
|
|
181 |
|
|
|
182 |
# plot spots |
|
|
183 |
threshold_row <- colData(spe_sub)[colData(spe_sub)$qc_cell_count,][,'array_row'] |
|
|
184 |
threshold_col <- colData(spe_sub)[colData(spe_sub)$qc_cell_count,][,'array_col'] |
|
|
185 |
print(ggplot(data.frame(threshold_row, threshold_col), aes(x=threshold_row, y=threshold_col)) + geom_point()) |
|
|
186 |
|
|
|
187 |
spe_sub |
|
|
188 |
} |
|
|
189 |
|
|
|
190 |
#' Discard spots based on previous QC metrics |
|
|
191 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
192 |
#' @return spe_sub Spatial Experiment object with low quality spots removed |
|
|
193 |
discard_spots <- function(spe_sub) { |
|
|
194 |
# get discarded spots from each QC operation |
|
|
195 |
qc_lib_size <- colData(spe_sub)$qc_lib_size |
|
|
196 |
qc_detected <- colData(spe_sub)$qc_detected |
|
|
197 |
qc_mito <- colData(spe_sub)$qc_mito |
|
|
198 |
qc_cell_count <- colData(spe_sub)$qc_cell_count |
|
|
199 |
|
|
|
200 |
# number of discarded spots for each metric |
|
|
201 |
apply(cbind(qc_lib_size, qc_detected, qc_mito, qc_cell_count), 2, sum) |
|
|
202 |
|
|
|
203 |
# combined set of discarded spots |
|
|
204 |
discard <- qc_lib_size | qc_detected | qc_mito | qc_cell_count |
|
|
205 |
print(table(discard)) |
|
|
206 |
|
|
|
207 |
# store in object |
|
|
208 |
colData(spe_sub)$discard <- discard |
|
|
209 |
|
|
|
210 |
# check spatial pattern of combined set of discarded spots |
|
|
211 |
# plot spots |
|
|
212 |
threshold_row <- colData(spe_sub)[colData(spe_sub)$discard,][,'array_row'] |
|
|
213 |
threshold_col <- colData(spe_sub)[colData(spe_sub)$discard,][,'array_col'] |
|
|
214 |
print(ggplot(data.frame(threshold_row, threshold_col), aes(x=threshold_row, y=threshold_col)) + geom_point()) |
|
|
215 |
|
|
|
216 |
spe_sub <- spe_sub[, !colData(spe_sub)$discard] |
|
|
217 |
print(dim(spe_sub)) |
|
|
218 |
spe_sub |
|
|
219 |
} |
|
|
220 |
|
|
|
221 |
#' Discard uninformative genes from the Spatial Experiment object |
|
|
222 |
#' @param spe_sub Spatial Experiment object filtered to only contain one sample |
|
|
223 |
#' @param var_threshold lower threshold for amount of variance willing to tolerate among the spots for a gene |
|
|
224 |
#' @return spe_sub Spatial Experiment object with uninformative genes removed |
|
|
225 |
discard_uninformative_genes <- function(spe_sub, var_threshold=1e-4) { |
|
|
226 |
logcounts <- t(logcounts(spe_sub)) |
|
|
227 |
mito <- (rowData(spe_sub)$gene_name %in% rowData(spe_sub)$gene_name[is_mito]) |
|
|
228 |
zero_expression <- apply(logcounts, 2, function(x) length(unique(x)) == 1) |
|
|
229 |
low_var <- apply(logcounts, 2, function(x) var(x) < 1e-4) |
|
|
230 |
discard <- zero_expression | low_var | mito |
|
|
231 |
spe_sub <- spe_sub[!discard,] |
|
|
232 |
spe_sub |
|
|
233 |
} |
|
|
234 |
``` |
|
|
235 |
|
|
|
236 |
### Driver Code |
|
|
237 |
|
|
|
238 |
In the following block, data is first fetched from SpatialLIBD and filtered to only include a single sample. Then, the sample is plotted and QC metrics are added to the SpatialExperiment object. |
|
|
239 |
|
|
|
240 |
Then, the following four QC operations are performed to get rid of low quality spots. Note that all spots are already over tissue. |
|
|
241 |
|
|
|
242 |
1. Set a lower threshold over library size. We want to remove spots with a low number of UMIs counts per spot. |
|
|
243 |
2. Set a lower threshold on the number of expressed features (i.e. genes) per spot. |
|
|
244 |
3. Set an upper threshold on the mumber of mitochondrial reads per spot. A high proportion of mitochondrial reads can indicate cell damage. |
|
|
245 |
4. Set an upper threshold on the number of cells per spot. Spots with high cell counts have a low number of expressed genes, meaning that experiments failed. |
|
|
246 |
|
|
|
247 |
After performing QC operations, low quality spots are removed from the SpatialExperiment object. |
|
|
248 |
|
|
|
249 |
```{r QC, message=FALSE} |
|
|
250 |
# Fetch data from SpatialLIBD |
|
|
251 |
spe <- fetch_data(type='spe') |
|
|
252 |
|
|
|
253 |
# Filter spatial experiment object to only include sample ID |
|
|
254 |
sample_id <- "151673" |
|
|
255 |
spe_sub <- spe[, spe$sample_id == sample_id] |
|
|
256 |
|
|
|
257 |
# Plot image and add QC metrics to spe object |
|
|
258 |
img <- plot_image(spe_sub) |
|
|
259 |
spe_sub <- add_QC_metrics(spe_sub) |
|
|
260 |
|
|
|
261 |
# Perform QC operations |
|
|
262 |
spe_sub <- threshold_lib_size(spe_sub, 800) |
|
|
263 |
spe_sub <- threshold_expressed_genes(spe_sub, 550) |
|
|
264 |
spe_sub <- threshold_mito_reads(spe_sub, 28) |
|
|
265 |
spe_sub <- threshold_num_cells(spe_sub, 18) |
|
|
266 |
|
|
|
267 |
# Discard low quality spots determined by QC operations |
|
|
268 |
spe_sub <- discard_spots(spe_sub) |
|
|
269 |
spe_sub <- discard_uninformative_genes(spe_sub) |
|
|
270 |
``` |