a b/README.Rmd
1
---
2
output: github_document
3
---
4
5
```{r, include = FALSE}
6
knitr::opts_chunk$set(
7
  collapse = TRUE,
8
  comment = "#>"
9
)
10
```
11
12
# GPROB <img src="man/figures/gprob.svg" width="181px" align="right" />
13
14
Multiple diseases can present with similar initial symptoms, making it
15
difficult to clinically differentiate between these conditions. GPROB uses
16
patients' genetic information to help prioritize a diagnosis. This genetic
17
diagnostic tool can be applied to any situation with phenotypically similar
18
diseases with different underlying genetics.
19
20
<!-- badges: start -->
21
[![R build status](https://github.com/immunogenomics/GPROB/workflows/R-CMD-check/badge.svg)](https://github.com/immunogenomics/GPROB/actions)
22
<!-- badges: end -->
23
24
25
## Citation
26
27
Please cite:
28
29
- Knevel, R. et al. [Using genetics to prioritize diagnoses for rheumatology
30
  outpatients with inflammatory arthritis.][1] Sci. Transl. Med. 12, (2020)
31
32
[1]: http://dx.doi.org/10.1126/scitranslmed.aay1548
33
34
35
## License
36
37
Please see the [LICENSE] file for details. [Contact us] for other licensing
38
options.
39
40
[LICENSE]: LICENSE
41
[Contact us]: mailto:soumya@broadinstitute.org
42
43
44
## Installation
45
46
Install and load the GPROB R package.
47
48
```{r, eval = FALSE}
49
devtools::install_github("immunogenomics/GPROB")
50
library(GPROB)
51
```
52
53
## Synopsis
54
55
GPROB estimates the probability that each individual has a given phenotype.
56
57
We need three inputs:
58
59
- Population prevalences of the phenotypes of interest.
60
61
- Odds ratios for SNP associations with the phenotypes.
62
63
- SNP genotypes (0, 1, 2) for each individual.
64
65
### Example
66
67
Let's use a small example with artificial data to learn how to use GPROB.
68
69
Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms
70
(SNPs) associated with rheumatoid arthritis (RA) or systemic lupus
71
erythematosus (SLE).
72
73
#### Prevalence
74
75
First, we should find out the prevalence of RA and SLE in the population that
76
is representative of our patients.
77
78
```{r}
79
prevalence <- c("RA" = 0.001, "SLE" = 0.001)
80
```
81
82
#### Odds Ratios
83
84
Next, we need to obtain the odds ratios (ORs) from published genome-wide
85
association studies (GWAS). We should be careful to note which alleles are
86
associated with the phenotype to compute the risk in the correct direction.
87
88
```{r}
89
or <- read.delim(
90
  sep = "",
91
  row.names = 1,
92
  text = "
93
snp  RA SLE
94
SNP1 1.0 0.4
95
SNP2 1.0 0.9
96
SNP3 1.0 1.3
97
SNP4 0.4 1.6
98
SNP5 0.9 1.0
99
SNP6 1.3 1.0
100
SNP7 1.6 1.0
101
")
102
or <- as.matrix(or)
103
```
104
105
#### Genotypes
106
107
Finally, we need the genotype data for each of our 10 patients. Here, the data
108
is coded in the form (0, 1, 2) to indicate the number of copies of the risk
109
allele.
110
111
```{r}
112
geno <- read.delim(
113
  sep = "",
114
  row.names = 1,
115
  text = "
116
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
117
 1    0    1    0    2    1    0
118
 2    0    0    1    0    2    2
119
 3    1    0    1    1    0    2
120
 4    1    1    0    2    0    0
121
 5    0    1    1    1    1    0
122
 6    0    0    1    3    0    2
123
 7    2    2    2    2    2    2
124
 8    1    2    0    2    1    1
125
 9    0    2    1   NA    1    2
126
10    1    0    2    2    2    0
127
")
128
geno <- as.matrix(geno)
129
```
130
131
#### Dealing with missing or invalid data
132
133
Before we run the `GPROB()` function, we need to deal with invalid and missing
134
data.
135
136
We remove individuals who have `NA` for any SNP:
137
138
```{r}
139
ix <- apply(geno, 1, function(x) !any(is.na(x)))
140
geno <- geno[ix,]
141
```
142
143
We remove individuals who have invalid allele counts:
144
145
```{r}
146
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
147
geno <- geno[ix,]
148
```
149
150
And we make sure that we use the same SNPs in the `or` and `geno` matrices:
151
152
```{r}
153
or <- or[colnames(geno),]
154
```
155
156
#### Run GPROB
157
158
Then we can run the GPROB function to estimate probabilities:
159
160
```{r}
161
library(GPROB)
162
res <- GPROB(prevalence, or, geno)
163
res
164
```
165
166
In this example, we might interpret the numbers as follows:
167
168
- Individual 2 has RA with probability 0.003, given individual genetic risk
169
  factors, disease prevalence, and the number of patients used in genetic risk
170
  score calculations.
171
172
- Individual 2 has RA with probability 0.77, conditional on the additional
173
  assumption that individual 2 has either RA or SLE.
174
175
## Calculations, step by step
176
177
Let's go through each step of GPROB to understand how how it works.
178
179
The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for disease *k* is defined as:
180
181
<p align="center">
182
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/>
183
</p>
184
185
where:
186
187
- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in individual *i*
188
189
- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in a genome-wide
190
  association study (GWAS) for disease *k*
191
192
<table><tr><td>
193
<b>Note:</b> We might want to consider shrinking the risk by some factor (e.g.
194
0.5) to correct for possible overestimation of the effect sizes due to
195
publication bias. In other words, consider running <code>geno <- 0.5 *
196
geno</code>.
197
</td></tr></table>
198
199
```{r}
200
risk <- geno %*% log(or)
201
risk
202
```
203
204
The known prevalence <i>V<sub>k</sub></i> of each disease in the general population:
205
206
```{r}
207
prevalence
208
```
209
210
We can calculate the population level probability <i>P<sub>ki</sub></i> that each individual
211
has the disease.
212
213
<p align="center">
214
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/>
215
</p>
216
217
We find <i>α<sub>k</sub></i> for each disease *k* by minimizing
218
<i>(P&#773;<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that the
219
mean probability <i>P&#773;<sub>k</sub></i> across individuals is equal to the
220
known prevalence <i>V<sub>k</sub></i> of the disease in the population.
221
222
```{r}
223
# @param alpha A constant that we choose manually.
224
# @param risk A vector of risk scores for individuals.
225
# @returns A vector of probabilities for each individual.
226
prob <- function(alpha, risk) {
227
  1 / (
228
    1 + exp(alpha - risk)
229
  )
230
}
231
alpha <- sapply(seq(ncol(risk)), function(i) {
232
  o <- optimize(
233
    f        = function(alpha, risk, prevalence) {
234
      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
235
    },
236
    interval = c(-100, 100),
237
    risk = risk[,i],
238
    prevalence = prevalence[i]
239
  )
240
  o$minimum
241
})
242
alpha
243
```
244
245
Now that we have computed alpha, we can compute the population-level
246
probabilities of disease for each individual.
247
248
```{r}
249
# population-level disease probability
250
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
251
p
252
```
253
254
Next we assume that each individual has one of the diseases:
255
256
<p align="center">
257
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/>
258
</p>
259
260
Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of each
261
disease *k*: 
262
263
<p align="center">
264
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/>
265
</p>
266
267
```{r}
268
# patient-level disease probability
269
cp <- p / rowSums(p)
270
cp
271
```
272