Diff of /README.md [000000] .. [c9fae8]

Switch to unified view

a b/README.md
1
# st-histology-ml
2
3
## Project Description
4
The goal of this project is to develop a machine learning model to predict spatially localized gene expression from histology images.
5
6
The field of barcode-based spatially resolved transcriptomics (SRT) uses spatially barcoded mRNA and next-generation sequencing technology to recover gene expression level data that corresponds to a specific locus (gene expression spot) within a tissue sample. In addition, haematoxylin-and-eosin (H&E) stained images provide a visual representation of these tissue sections and can be leveraged as additional information for the analyses of these data.
7
8
Now that spatially localized gene expression data and associated histology images are available, one goal is to predict gene expression levels of specific tissue loci directly from a histology image of the tissue sample itself, instead of utilizing sequencing technology to determine expression levels. This is desirable so that predicted expression can be obtained on future H&E images that did not have corresponding gene expression measured, thereby obtaining spatial gene expression without the cost of directly measuring it. This project is concerned with developing a machine learning model to perform this prediction by extracting features of the H&E image as inputs, and outputting the predicted expression levels of a specific gene at each image location. 
9
10
## Data
11
Data for this project was retrieved from the SpatialLIBD R/Bioconductor package (http://spatial.libd.org/spatialLIBD/). This package includes 12 lowres (600x600 pixels) LIBD human dorsolateral pre-frontal cortex (DLPFC) spatial transcriptomics samples generated with the 10x Genomics Visium platform. Each sample has an associated histology image and bulk-RNAseq logcounts matrix.
12
13
## Feature Extraction
14
The most basic features used as inputs for the model were the RGB/Grayscale color space and the HSL (Hue, Saturation, Lightness) color space. To develop more advanced features, a 13x13 pixel image patch centered around each barcoded spot was inputted into an autoencoder. Then, the features within the latent space were used as inputs for classification.
15
16
## Classification
17
The baseline model performed binary classification for a select number of genes strongly correlated to the input features. Target classes were 0 (no gene expression at spot x) and 1 (positive gene expression at spot x). To perform binary classification, features were standardized and a logistic regression model performed prediction. 
18
19
## File Structure
20
| <br />
21
| _ Preprocessing.Rmd <br />
22
|    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; R Markdown notebook containing quality control operations to remove uninformative genes and spots. <br />
23
| <br />
24
| _ EDA.Rmd <br />
25
|     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; R Markdown notebook containing exploratory data analysis, including spot plots, feature extraction methods, and correlation analysis <br />
26
| <br />
27
| _ Models.Rmd <br />
28
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; R Markdown notebook performing binary classification of select genes found to be highly correlated with the input features. <br />