Switch to unified view

a/README.md b/README.md
1
1
---
2
# GPROB <img src="man/figures/gprob.svg" width="181px" align="right" />
2
output: github_document
3
3
---
4
Multiple diseases can present with similar initial symptoms, making it
4
5
difficult to clinically differentiate between these conditions. GPROB
5
```{r, include = FALSE}
6
uses patients’ genetic information to help prioritize a diagnosis. This
6
knitr::opts_chunk$set(
7
genetic diagnostic tool can be applied to any situation with
7
  collapse = TRUE,
8
phenotypically similar diseases with different underlying genetics.
8
  comment = "#>"
9
9
)
10
<!-- badges: start -->
10
```
11
11
12
[![R build
12
# GPROB
13
status](https://github.com/immunogenomics/GPROB/workflows/R-CMD-check/badge.svg)](https://github.com/immunogenomics/GPROB/actions)
13
14
<!-- badges: end -->
14
Multiple diseases can present with similar initial symptoms, making it
15
15
difficult to clinically differentiate between these conditions. GPROB uses
16
## Citation
16
patients' genetic information to help prioritize a diagnosis. This genetic
17
17
diagnostic tool can be applied to any situation with phenotypically similar
18
Please cite:
18
diseases with different underlying genetics.
19
19
20
-   Knevel, R. et al. [Using genetics to prioritize diagnoses for
20
<!-- badges: start -->
21
    rheumatology outpatients with inflammatory
21
[![R build status](https://github.com/immunogenomics/GPROB/workflows/R-CMD-check/badge.svg)](https://github.com/immunogenomics/GPROB/actions)
22
    arthritis.](http://dx.doi.org/10.1126/scitranslmed.aay1548) Sci.
22
<!-- badges: end -->
23
    Transl. Med. 12, (2020)
23
24
24
25
## License
25
## Citation
26
26
27
Please see the [LICENSE](LICENSE) file for details. [Contact
27
Please cite:
28
us](mailto:soumya@broadinstitute.org) for other licensing options.
28
29
29
- Knevel, R. et al. [Using genetics to prioritize diagnoses for rheumatology
30
## Installation
30
  outpatients with inflammatory arthritis.][1] Sci. Transl. Med. 12, (2020)
31
31
32
Install and load the GPROB R package.
32
[1]: http://dx.doi.org/10.1126/scitranslmed.aay1548
33
33
34
``` r
34
35
devtools::install_github("immunogenomics/GPROB")
35
## License
36
library(GPROB)
36
37
```
37
Please see the [LICENSE] file for details. [Contact us] for other licensing
38
38
options.
39
## Synopsis
39
40
40
[LICENSE]: LICENSE
41
GPROB estimates the probability that each individual has a given
41
[Contact us]: mailto:soumya@broadinstitute.org
42
phenotype.
42
43
43
44
We need three inputs:
44
## Installation
45
45
46
-   Population prevalences of the phenotypes of interest.
46
Install and load the GPROB R package.
47
47
48
-   Odds ratios for SNP associations with the phenotypes.
48
```{r, eval = FALSE}
49
49
devtools::install_github("immunogenomics/GPROB")
50
-   SNP genotypes (0, 1, 2) for each individual.
50
library(GPROB)
51
51
```
52
### Example
52
53
53
## Synopsis
54
Let’s use a small example with artificial data to learn how to use
54
55
GPROB.
55
GPROB estimates the probability that each individual has a given phenotype.
56
56
57
Suppose we have 10 patients, and we know of 7 single nucleotide
57
We need three inputs:
58
polymorphisms (SNPs) associated with rheumatoid arthritis (RA) or
58
59
systemic lupus erythematosus (SLE).
59
- Population prevalences of the phenotypes of interest.
60
60
61
#### Prevalence
61
- Odds ratios for SNP associations with the phenotypes.
62
62
63
First, we should find out the prevalence of RA and SLE in the population
63
- SNP genotypes (0, 1, 2) for each individual.
64
that is representative of our patients.
64
65
65
### Example
66
``` r
66
67
prevalence <- c("RA" = 0.001, "SLE" = 0.001)
67
Let's use a small example with artificial data to learn how to use GPROB.
68
```
68
69
69
Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms
70
#### Odds Ratios
70
(SNPs) associated with rheumatoid arthritis (RA) or systemic lupus
71
71
erythematosus (SLE).
72
Next, we need to obtain the odds ratios (ORs) from published genome-wide
72
73
association studies (GWAS). We should be careful to note which alleles
73
#### Prevalence
74
are associated with the phenotype to compute the risk in the correct
74
75
direction.
75
First, we should find out the prevalence of RA and SLE in the population that
76
76
is representative of our patients.
77
``` r
77
78
or <- read.delim(
78
```{r}
79
  sep = "",
79
prevalence <- c("RA" = 0.001, "SLE" = 0.001)
80
  row.names = 1,
80
```
81
  text = "
81
82
snp  RA SLE
82
#### Odds Ratios
83
SNP1 1.0 0.4
83
84
SNP2 1.0 0.9
84
Next, we need to obtain the odds ratios (ORs) from published genome-wide
85
SNP3 1.0 1.3
85
association studies (GWAS). We should be careful to note which alleles are
86
SNP4 0.4 1.6
86
associated with the phenotype to compute the risk in the correct direction.
87
SNP5 0.9 1.0
87
88
SNP6 1.3 1.0
88
```{r}
89
SNP7 1.6 1.0
89
or <- read.delim(
90
")
90
  sep = "",
91
or <- as.matrix(or)
91
  row.names = 1,
92
```
92
  text = "
93
93
snp  RA SLE
94
#### Genotypes
94
SNP1 1.0 0.4
95
95
SNP2 1.0 0.9
96
Finally, we need the genotype data for each of our 10 patients. Here,
96
SNP3 1.0 1.3
97
the data is coded in the form (0, 1, 2) to indicate the number of copies
97
SNP4 0.4 1.6
98
of the risk allele.
98
SNP5 0.9 1.0
99
99
SNP6 1.3 1.0
100
``` r
100
SNP7 1.6 1.0
101
geno <- read.delim(
101
")
102
  sep = "",
102
or <- as.matrix(or)
103
  row.names = 1,
103
```
104
  text = "
104
105
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
105
#### Genotypes
106
 1    0    1    0    2    1    0
106
107
 2    0    0    1    0    2    2
107
Finally, we need the genotype data for each of our 10 patients. Here, the data
108
 3    1    0    1    1    0    2
108
is coded in the form (0, 1, 2) to indicate the number of copies of the risk
109
 4    1    1    0    2    0    0
109
allele.
110
 5    0    1    1    1    1    0
110
111
 6    0    0    1    3    0    2
111
```{r}
112
 7    2    2    2    2    2    2
112
geno <- read.delim(
113
 8    1    2    0    2    1    1
113
  sep = "",
114
 9    0    2    1   NA    1    2
114
  row.names = 1,
115
10    1    0    2    2    2    0
115
  text = "
116
")
116
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
117
geno <- as.matrix(geno)
117
 1    0    1    0    2    1    0
118
```
118
 2    0    0    1    0    2    2
119
119
 3    1    0    1    1    0    2
120
#### Dealing with missing or invalid data
120
 4    1    1    0    2    0    0
121
121
 5    0    1    1    1    1    0
122
Before we run the `GPROB()` function, we need to deal with invalid and
122
 6    0    0    1    3    0    2
123
missing data.
123
 7    2    2    2    2    2    2
124
124
 8    1    2    0    2    1    1
125
We remove individuals who have `NA` for any SNP:
125
 9    0    2    1   NA    1    2
126
126
10    1    0    2    2    2    0
127
``` r
127
")
128
ix <- apply(geno, 1, function(x) !any(is.na(x)))
128
geno <- as.matrix(geno)
129
geno <- geno[ix,]
129
```
130
```
130
131
131
#### Dealing with missing or invalid data
132
We remove individuals who have invalid allele counts:
132
133
133
Before we run the `GPROB()` function, we need to deal with invalid and missing
134
``` r
134
data.
135
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
135
136
geno <- geno[ix,]
136
We remove individuals who have `NA` for any SNP:
137
```
137
138
138
```{r}
139
And we make sure that we use the same SNPs in the `or` and `geno`
139
ix <- apply(geno, 1, function(x) !any(is.na(x)))
140
matrices:
140
geno <- geno[ix,]
141
141
```
142
``` r
142
143
or <- or[colnames(geno),]
143
We remove individuals who have invalid allele counts:
144
```
144
145
145
```{r}
146
#### Run GPROB
146
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
147
147
geno <- geno[ix,]
148
Then we can run the GPROB function to estimate probabilities:
148
```
149
149
150
``` r
150
And we make sure that we use the same SNPs in the `or` and `geno` matrices:
151
library(GPROB)
151
152
res <- GPROB(prevalence, or, geno)
152
```{r}
153
res
153
or <- or[colnames(geno),]
154
#> $pop_prob
154
```
155
#>              RA          SLE
155
156
#> 1  0.0003556116 0.0017797758
156
#### Run GPROB
157
#> 2  0.0033703376 0.0010049932
157
158
#> 3  0.0016672084 0.0006434285
158
Then we can run the GPROB function to estimate probabilities:
159
#> 4  0.0003951084 0.0007126714
159
160
#> 5  0.0008885550 0.0014465506
160
```{r}
161
#> 7  0.0005407850 0.0004337103
161
library(GPROB)
162
#> 8  0.0004622457 0.0006414499
162
res <- GPROB(prevalence, or, geno)
163
#> 10 0.0003200618 0.0013374018
163
res
164
#> 
164
```
165
#> $cond_prob
165
166
#>           RA       SLE
166
In this example, we might interpret the numbers as follows:
167
#> 1  0.1665326 0.8334674
167
168
#> 2  0.7703046 0.2296954
168
- Individual 2 has RA with probability 0.003, given individual genetic risk
169
#> 3  0.7215363 0.2784637
169
  factors, disease prevalence, and the number of patients used in genetic risk
170
#> 4  0.3566669 0.6433331
170
  score calculations.
171
#> 5  0.3805203 0.6194797
171
172
#> 7  0.5549386 0.4450614
172
- Individual 2 has RA with probability 0.77, conditional on the additional
173
#> 8  0.4188163 0.5811837
173
  assumption that individual 2 has either RA or SLE.
174
#> 10 0.1931034 0.8068966
174
175
```
175
## Calculations, step by step
176
176
177
In this example, we might interpret the numbers as follows:
177
Let's go through each step of GPROB to understand how how it works.
178
178
179
-   Individual 2 has RA with probability 0.003, given individual genetic
179
The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for disease *k* is defined as:
180
    risk factors, disease prevalence, and the number of patients used in
180
181
    genetic risk score calculations.
181
<p align="center">
182
182
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/>
183
-   Individual 2 has RA with probability 0.77, conditional on the
183
</p>
184
    additional assumption that individual 2 has either RA or SLE.
184
185
185
where:
186
## Calculations, step by step
186
187
187
- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in individual *i*
188
Let’s go through each step of GPROB to understand how how it works.
188
189
189
- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in a genome-wide
190
The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for
190
  association study (GWAS) for disease *k*
191
disease *k* is defined as:
191
192
192
<table><tr><td>
193
<p align="center">
193
<b>Note:</b> We might want to consider shrinking the risk by some factor (e.g.
194
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/>
194
0.5) to correct for possible overestimation of the effect sizes due to
195
</p>
195
publication bias. In other words, consider running <code>geno <- 0.5 *
196
196
geno</code>.
197
where:
197
</td></tr></table>
198
198
199
-   <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in
199
```{r}
200
    individual *i*
200
risk <- geno %*% log(or)
201
201
risk
202
-   <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in
202
```
203
    a genome-wide association study (GWAS) for disease *k*
203
204
204
The known prevalence <i>V<sub>k</sub></i> of each disease in the general population:
205
<table>
205
206
<tr>
206
```{r}
207
<td>
207
prevalence
208
<b>Note:</b> We might want to consider shrinking the risk by some factor
208
```
209
(e.g. 0.5) to correct for possible overestimation of the effect sizes
209
210
due to publication bias. In other words, consider running <code>geno \<-
210
We can calculate the population level probability <i>P<sub>ki</sub></i> that each individual
211
0.5 \* geno</code>.
211
has the disease.
212
</td>
212
213
</tr>
213
<p align="center">
214
</table>
214
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/>
215
215
</p>
216
``` r
216
217
risk <- geno %*% log(or)
217
We find <i>α<sub>k</sub></i> for each disease *k* by minimizing
218
risk
218
<i>(P&#773;<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that the
219
#>            RA         SLE
219
mean probability <i>P&#773;<sub>k</sub></i> across individuals is equal to the
220
#> 1  -1.9379420  0.83464674
220
known prevalence <i>V<sub>k</sub></i> of the disease in the population.
221
#> 2   0.3140075  0.26236426
221
222
#> 3  -0.3915622 -0.18392284
222
```{r}
223
#> 4  -1.8325815 -0.08164399
223
# @param alpha A constant that we choose manually.
224
#> 5  -1.0216512  0.62700738
224
# @param risk A vector of risk scores for individuals.
225
#> 7  -1.5185740 -0.57856671
225
# @returns A vector of probabilities for each individual.
226
#> 8  -1.6755777 -0.18700450
226
prob <- function(alpha, risk) {
227
#> 10 -2.0433025  0.54844506
227
  1 / (
228
```
228
    1 + exp(alpha - risk)
229
229
  )
230
The known prevalence <i>V<sub>k</sub></i> of each disease in the general
230
}
231
population:
231
alpha <- sapply(seq(ncol(risk)), function(i) {
232
232
  o <- optimize(
233
``` r
233
    f        = function(alpha, risk, prevalence) {
234
prevalence
234
      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
235
#>    RA   SLE 
235
    },
236
#> 0.001 0.001
236
    interval = c(-100, 100),
237
```
237
    risk = risk[,i],
238
238
    prevalence = prevalence[i]
239
We can calculate the population level probability <i>P<sub>ki</sub></i>
239
  )
240
that each individual has the disease.
240
  o$minimum
241
241
})
242
<p align="center">
242
alpha
243
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/>
243
```
244
</p>
244
245
245
Now that we have computed alpha, we can compute the population-level
246
We find <i>α<sub>k</sub></i> for each disease *k* by minimizing
246
probabilities of disease for each individual.
247
<i>(P̅<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that
247
248
the mean probability <i>P̅<sub>k</sub></i> across individuals is equal to
248
```{r}
249
the known prevalence <i>V<sub>k</sub></i> of the disease in the
249
# population-level disease probability
250
population.
250
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
251
251
p
252
``` r
252
```
253
# @param alpha A constant that we choose manually.
253
254
# @param risk A vector of risk scores for individuals.
254
Next we assume that each individual has one of the diseases:
255
# @returns A vector of probabilities for each individual.
255
256
prob <- function(alpha, risk) {
256
<p align="center">
257
  1 / (
257
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/>
258
    1 + exp(alpha - risk)
258
</p>
259
  )
259
260
}
260
Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of each
261
alpha <- sapply(seq(ncol(risk)), function(i) {
261
disease *k*: 
262
  o <- optimize(
262
263
    f        = function(alpha, risk, prevalence) {
263
<p align="center">
264
      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
264
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/>
265
    },
265
</p>
266
    interval = c(-100, 100),
266
267
    risk = risk[,i],
267
```{r}
268
    prevalence = prevalence[i]
268
# patient-level disease probability
269
  )
269
cp <- p / rowSums(p)
270
  o$minimum
270
cp
271
})
271
```
272
alpha
272
273
#> [1] 6.003374 7.164133
274
```
275
276
Now that we have computed alpha, we can compute the population-level
277
probabilities of disease for each individual.
278
279
``` r
280
# population-level disease probability
281
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
282
p
283
#>            [,1]         [,2]
284
#> 1  0.0003556116 0.0017797758
285
#> 2  0.0033703376 0.0010049932
286
#> 3  0.0016672084 0.0006434285
287
#> 4  0.0003951084 0.0007126714
288
#> 5  0.0008885550 0.0014465506
289
#> 7  0.0005407850 0.0004337103
290
#> 8  0.0004622457 0.0006414499
291
#> 10 0.0003200618 0.0013374018
292
```
293
294
Next we assume that each individual has one of the diseases:
295
296
<p align="center">
297
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/>
298
</p>
299
300
Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of
301
each disease *k*:
302
303
<p align="center">
304
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/>
305
</p>
306
307
``` r
308
# patient-level disease probability
309
cp <- p / rowSums(p)
310
cp
311
#>         [,1]      [,2]
312
#> 1  0.1665326 0.8334674
313
#> 2  0.7703046 0.2296954
314
#> 3  0.7215363 0.2784637
315
#> 4  0.3566669 0.6433331
316
#> 5  0.3805203 0.6194797
317
#> 7  0.5549386 0.4450614
318
#> 8  0.4188163 0.5811837
319
#> 10 0.1931034 0.8068966
320
```