GPROB / Git / Diff of /README.md

Models:

MarcoTheBlack/

GPROB

Downloads: 1

Diff of /README.md [d9f92b] .. [4ed68b]

Switch to unified view

-a/README.md
+b/README.md
+---
-# GPROB <img src="man/figures/gprob.svg" width="181px" align="right" />
+output: github_document
+---
-Multiple diseases can present with similar initial symptoms, making it
-difficult to clinically differentiate between these conditions. GPROB
+```{r, include = FALSE}
-uses patients’ genetic information to help prioritize a diagnosis. This
+knitr::opts_chunk$set(
-genetic diagnostic tool can be applied to any situation with
+  collapse = TRUE,
-phenotypically similar diseases with different underlying genetics.
+  comment = "#>"
+)
-<!-- badges: start -->
+```
-[![R build
+# GPROB
-status](https://github.com/immunogenomics/GPROB/workflows/R-CMD-check/badge.svg)](https://github.com/immunogenomics/GPROB/actions)
-<!-- badges: end -->
+Multiple diseases can present with similar initial symptoms, making it
+difficult to clinically differentiate between these conditions. GPROB uses
-## Citation
+patients' genetic information to help prioritize a diagnosis. This genetic
+diagnostic tool can be applied to any situation with phenotypically similar
-Please cite:
+diseases with different underlying genetics.
--   Knevel, R. et al. [Using genetics to prioritize diagnoses for
+<!-- badges: start -->
-    rheumatology outpatients with inflammatory
+[![R build status](https://github.com/immunogenomics/GPROB/workflows/R-CMD-check/badge.svg)](https://github.com/immunogenomics/GPROB/actions)
-    arthritis.](http://dx.doi.org/10.1126/scitranslmed.aay1548) Sci.
+<!-- badges: end -->
-    Transl. Med. 12, (2020)
-## License
+## Citation
-Please see the [LICENSE](LICENSE) file for details. [Contact
+Please cite:
-us](mailto:soumya@broadinstitute.org) for other licensing options.
+- Knevel, R. et al. [Using genetics to prioritize diagnoses for rheumatology
-## Installation
+  outpatients with inflammatory arthritis.][1] Sci. Transl. Med. 12, (2020)
-Install and load the GPROB R package.
+[1]: http://dx.doi.org/10.1126/scitranslmed.aay1548
-``` r
-devtools::install_github("immunogenomics/GPROB")
+## License
-library(GPROB)
-```
+Please see the [LICENSE] file for details. [Contact us] for other licensing
+options.
-## Synopsis
+[LICENSE]: LICENSE
-GPROB estimates the probability that each individual has a given
+[Contact us]: mailto:soumya@broadinstitute.org
-phenotype.
-We need three inputs:
+## Installation
--   Population prevalences of the phenotypes of interest.
+Install and load the GPROB R package.
--   Odds ratios for SNP associations with the phenotypes.
+```{r, eval = FALSE}
+devtools::install_github("immunogenomics/GPROB")
--   SNP genotypes (0, 1, 2) for each individual.
+library(GPROB)
+```
-### Example
+## Synopsis
-Let’s use a small example with artificial data to learn how to use
-GPROB.
+GPROB estimates the probability that each individual has a given phenotype.
-Suppose we have 10 patients, and we know of 7 single nucleotide
+We need three inputs:
-polymorphisms (SNPs) associated with rheumatoid arthritis (RA) or
-systemic lupus erythematosus (SLE).
+- Population prevalences of the phenotypes of interest.
-#### Prevalence
+- Odds ratios for SNP associations with the phenotypes.
-First, we should find out the prevalence of RA and SLE in the population
+- SNP genotypes (0, 1, 2) for each individual.
-that is representative of our patients.
+### Example
-``` r
-prevalence <- c("RA" = 0.001, "SLE" = 0.001)
+Let's use a small example with artificial data to learn how to use GPROB.
-```
+Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms
-#### Odds Ratios
+(SNPs) associated with rheumatoid arthritis (RA) or systemic lupus
+erythematosus (SLE).
-Next, we need to obtain the odds ratios (ORs) from published genome-wide
-association studies (GWAS). We should be careful to note which alleles
+#### Prevalence
-are associated with the phenotype to compute the risk in the correct
-direction.
+First, we should find out the prevalence of RA and SLE in the population that
+is representative of our patients.
-``` r
-or <- read.delim(
+```{r}
-  sep = "",
+prevalence <- c("RA" = 0.001, "SLE" = 0.001)
-  row.names = 1,
+```
-  text = "
-snp  RA SLE
+#### Odds Ratios
-SNP1 1.0 0.4
-SNP2 1.0 0.9
+Next, we need to obtain the odds ratios (ORs) from published genome-wide
-SNP3 1.0 1.3
+association studies (GWAS). We should be careful to note which alleles are
-SNP4 0.4 1.6
+associated with the phenotype to compute the risk in the correct direction.
-SNP5 0.9 1.0
-SNP6 1.3 1.0
+```{r}
-SNP7 1.6 1.0
+or <- read.delim(
-")
+  sep = "",
-or <- as.matrix(or)
+  row.names = 1,
-```
+  text = "
+snp  RA SLE
-#### Genotypes
+SNP1 1.0 0.4
+SNP2 1.0 0.9
-Finally, we need the genotype data for each of our 10 patients. Here,
+SNP3 1.0 1.3
-the data is coded in the form (0, 1, 2) to indicate the number of copies
+SNP4 0.4 1.6
-of the risk allele.
+SNP5 0.9 1.0
+SNP6 1.3 1.0
-``` r
+SNP7 1.6 1.0
-geno <- read.delim(
+")
-  sep = "",
+or <- as.matrix(or)
-  row.names = 1,
+```
-  text = "
-id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
+#### Genotypes
-0    1    0    2    1    0
-0    0    1    0    2    2
+Finally, we need the genotype data for each of our 10 patients. Here, the data
-1    0    1    1    0    2
+is coded in the form (0, 1, 2) to indicate the number of copies of the risk
-1    1    0    2    0    0
+allele.
-0    1    1    1    1    0
-0    0    1    3    0    2
+```{r}
-2    2    2    2    2    2
+geno <- read.delim(
-1    2    0    2    1    1
+  sep = "",
-0    2    1   NA    1    2
+  row.names = 1,
-1    0    2    2    2    0
+  text = "
-")
+id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
-geno <- as.matrix(geno)
+0    1    0    2    1    0
-```
+0    0    1    0    2    2
+1    0    1    1    0    2
-#### Dealing with missing or invalid data
+1    1    0    2    0    0
+0    1    1    1    1    0
-Before we run the `GPROB()` function, we need to deal with invalid and
+0    0    1    3    0    2
-missing data.
+2    2    2    2    2    2
+1    2    0    2    1    1
-We remove individuals who have `NA` for any SNP:
+0    2    1   NA    1    2
+1    0    2    2    2    0
-``` r
+")
-ix <- apply(geno, 1, function(x) !any(is.na(x)))
+geno <- as.matrix(geno)
-geno <- geno[ix,]
+```
-```
+#### Dealing with missing or invalid data
-We remove individuals who have invalid allele counts:
+Before we run the `GPROB()` function, we need to deal with invalid and missing
-``` r
+data.
-ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
-geno <- geno[ix,]
+We remove individuals who have `NA` for any SNP:
-```
+```{r}
-And we make sure that we use the same SNPs in the `or` and `geno`
+ix <- apply(geno, 1, function(x) !any(is.na(x)))
-matrices:
+geno <- geno[ix,]
+```
-``` r
-or <- or[colnames(geno),]
+We remove individuals who have invalid allele counts:
-```
+```{r}
-#### Run GPROB
+ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
+geno <- geno[ix,]
-Then we can run the GPROB function to estimate probabilities:
+```
-``` r
+And we make sure that we use the same SNPs in the `or` and `geno` matrices:
-library(GPROB)
-res <- GPROB(prevalence, or, geno)
+```{r}
-res
+or <- or[colnames(geno),]
-#> $pop_prob
+```
-#>              RA          SLE
-#> 1  0.0003556116 0.0017797758
+#### Run GPROB
-#> 2  0.0033703376 0.0010049932
-#> 3  0.0016672084 0.0006434285
+Then we can run the GPROB function to estimate probabilities:
-#> 4  0.0003951084 0.0007126714
-#> 5  0.0008885550 0.0014465506
+```{r}
-#> 7  0.0005407850 0.0004337103
+library(GPROB)
-#> 8  0.0004622457 0.0006414499
+res <- GPROB(prevalence, or, geno)
-#> 10 0.0003200618 0.0013374018
+res
-#>
+```
-#> $cond_prob
-#>           RA       SLE
+In this example, we might interpret the numbers as follows:
-#> 1  0.1665326 0.8334674
-#> 2  0.7703046 0.2296954
+- Individual 2 has RA with probability 0.003, given individual genetic risk
-#> 3  0.7215363 0.2784637
+  factors, disease prevalence, and the number of patients used in genetic risk
-#> 4  0.3566669 0.6433331
+  score calculations.
-#> 5  0.3805203 0.6194797
-#> 7  0.5549386 0.4450614
+- Individual 2 has RA with probability 0.77, conditional on the additional
-#> 8  0.4188163 0.5811837
+  assumption that individual 2 has either RA or SLE.
-#> 10 0.1931034 0.8068966
-```
+## Calculations, step by step
-In this example, we might interpret the numbers as follows:
+Let's go through each step of GPROB to understand how how it works.
--   Individual 2 has RA with probability 0.003, given individual genetic
+The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for disease *k* is defined as:
-    risk factors, disease prevalence, and the number of patients used in
-    genetic risk score calculations.
+<p align="center">
+<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/>
--   Individual 2 has RA with probability 0.77, conditional on the
+</p>
-    additional assumption that individual 2 has either RA or SLE.
+where:
-## Calculations, step by step
+- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in individual *i*
-Let’s go through each step of GPROB to understand how how it works.
+- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in a genome-wide
-The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for
+  association study (GWAS) for disease *k*
-disease *k* is defined as:
+<table><tr><td>
-<p align="center">
+<b>Note:</b> We might want to consider shrinking the risk by some factor (e.g.
-<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/>
+.5) to correct for possible overestimation of the effect sizes due to
-</p>
+publication bias. In other words, consider running <code>geno <- 0.5 *
+geno</code>.
-where:
+</td></tr></table>
--   <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in
+```{r}
-    individual *i*
+risk <- geno %*% log(or)
+risk
--   <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in
+```
-    a genome-wide association study (GWAS) for disease *k*
+The known prevalence <i>V<sub>k</sub></i> of each disease in the general population:
-<table>
-<tr>
+```{r}
-<td>
+prevalence
-<b>Note:</b> We might want to consider shrinking the risk by some factor
+```
-(e.g. 0.5) to correct for possible overestimation of the effect sizes
-due to publication bias. In other words, consider running <code>geno \<-
+We can calculate the population level probability <i>P<sub>ki</sub></i> that each individual
-.5 \* geno</code>.
+has the disease.
-</td>
-</tr>
+<p align="center">
-</table>
+<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/>
+</p>
-``` r
-risk <- geno %*% log(or)
+We find <i>α<sub>k</sub></i> for each disease *k* by minimizing
-risk
+<i>(P&#773;<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that the
-#>            RA         SLE
+mean probability <i>P&#773;<sub>k</sub></i> across individuals is equal to the
-#> 1  -1.9379420  0.83464674
+known prevalence <i>V<sub>k</sub></i> of the disease in the population.
-#> 2   0.3140075  0.26236426
-#> 3  -0.3915622 -0.18392284
+```{r}
-#> 4  -1.8325815 -0.08164399
+# @param alpha A constant that we choose manually.
-#> 5  -1.0216512  0.62700738
+# @param risk A vector of risk scores for individuals.
-#> 7  -1.5185740 -0.57856671
+# @returns A vector of probabilities for each individual.
-#> 8  -1.6755777 -0.18700450
+prob <- function(alpha, risk) {
-#> 10 -2.0433025  0.54844506
+/ (
-```
++ exp(alpha - risk)
+  )
-The known prevalence <i>V<sub>k</sub></i> of each disease in the general
+}
-population:
+alpha <- sapply(seq(ncol(risk)), function(i) {
+  o <- optimize(
-``` r
+    f        = function(alpha, risk, prevalence) {
-prevalence
+      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
-#>    RA   SLE
+    },
-#> 0.001 0.001
+    interval = c(-100, 100),
-```
+    risk = risk[,i],
+    prevalence = prevalence[i]
-We can calculate the population level probability <i>P<sub>ki</sub></i>
+  )
-that each individual has the disease.
+  o$minimum
+})
-<p align="center">
+alpha
-<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/>
+```
-</p>
+Now that we have computed alpha, we can compute the population-level
-We find <i>α<sub>k</sub></i> for each disease *k* by minimizing
+probabilities of disease for each individual.
-<i>(P̅<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that
-the mean probability <i>P̅<sub>k</sub></i> across individuals is equal to
+```{r}
-the known prevalence <i>V<sub>k</sub></i> of the disease in the
+# population-level disease probability
-population.
+p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
+p
-``` r
+```
-# @param alpha A constant that we choose manually.
-# @param risk A vector of risk scores for individuals.
+Next we assume that each individual has one of the diseases:
-# @returns A vector of probabilities for each individual.
-prob <- function(alpha, risk) {
+<p align="center">
-/ (
+<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/>
-+ exp(alpha - risk)
+</p>
+Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of each
-alpha <- sapply(seq(ncol(risk)), function(i) {
+disease *k*:
-  o <- optimize(
-    f        = function(alpha, risk, prevalence) {
+<p align="center">
-      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
+<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/>
-    },
+</p>
-    interval = c(-100, 100),
-    risk = risk[,i],
+```{r}
-    prevalence = prevalence[i]
+# patient-level disease probability
+cp <- p / rowSums(p)
-  o$minimum
+cp
-})
+```
-alpha
-#> [1] 6.003374 7.164133
-```
-Now that we have computed alpha, we can compute the population-level
-probabilities of disease for each individual.
-``` r
-# population-level disease probability
-p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
-#>            [,1]         [,2]
-#> 1  0.0003556116 0.0017797758
-#> 2  0.0033703376 0.0010049932
-#> 3  0.0016672084 0.0006434285
-#> 4  0.0003951084 0.0007126714
-#> 5  0.0008885550 0.0014465506
-#> 7  0.0005407850 0.0004337103
-#> 8  0.0004622457 0.0006414499
-#> 10 0.0003200618 0.0013374018
-```
-Next we assume that each individual has one of the diseases:
-<p align="center">
-<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/>
-</p>
-Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of
-each disease *k*:
-<p align="center">
-<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/>
-</p>
-``` r
-# patient-level disease probability
-cp <- p / rowSums(p)
-cp
-#>         [,1]      [,2]
-#> 1  0.1665326 0.8334674
-#> 2  0.7703046 0.2296954
-#> 3  0.7215363 0.2784637
-#> 4  0.3566669 0.6433331
-#> 5  0.3805203 0.6194797
-#> 7  0.5549386 0.4450614
-#> 8  0.4188163 0.5811837
-#> 10 0.1931034 0.8068966
-```