|
a/README.md |
|
b/README.md |
1 |
|
1 |
--- |
2 |
# GPROB <img src="man/figures/gprob.svg" width="181px" align="right" /> |
2 |
output: github_document |
3 |
|
3 |
--- |
4 |
Multiple diseases can present with similar initial symptoms, making it |
4 |
|
5 |
difficult to clinically differentiate between these conditions. GPROB |
5 |
```{r, include = FALSE} |
6 |
uses patients’ genetic information to help prioritize a diagnosis. This |
6 |
knitr::opts_chunk$set( |
7 |
genetic diagnostic tool can be applied to any situation with |
7 |
collapse = TRUE, |
8 |
phenotypically similar diseases with different underlying genetics. |
8 |
comment = "#>" |
9 |
|
9 |
) |
10 |
<!-- badges: start --> |
10 |
``` |
11 |
|
11 |
|
12 |
[](https://github.com/immunogenomics/GPROB/actions) |
13 |
|
14 |
<!-- badges: end --> |
14 |
Multiple diseases can present with similar initial symptoms, making it |
15 |
|
15 |
difficult to clinically differentiate between these conditions. GPROB uses |
16 |
## Citation |
16 |
patients' genetic information to help prioritize a diagnosis. This genetic |
17 |
|
17 |
diagnostic tool can be applied to any situation with phenotypically similar |
18 |
Please cite: |
18 |
diseases with different underlying genetics. |
19 |
|
19 |
|
20 |
- Knevel, R. et al. [Using genetics to prioritize diagnoses for |
20 |
<!-- badges: start --> |
21 |
rheumatology outpatients with inflammatory |
21 |
[](https://github.com/immunogenomics/GPROB/actions) |
22 |
arthritis.](http://dx.doi.org/10.1126/scitranslmed.aay1548) Sci. |
22 |
<!-- badges: end --> |
23 |
Transl. Med. 12, (2020) |
23 |
|
24 |
|
24 |
|
25 |
## License |
25 |
## Citation |
26 |
|
26 |
|
27 |
Please see the [LICENSE](LICENSE) file for details. [Contact |
27 |
Please cite: |
28 |
us](mailto:soumya@broadinstitute.org) for other licensing options. |
28 |
|
29 |
|
29 |
- Knevel, R. et al. [Using genetics to prioritize diagnoses for rheumatology |
30 |
## Installation |
30 |
outpatients with inflammatory arthritis.][1] Sci. Transl. Med. 12, (2020) |
31 |
|
31 |
|
32 |
Install and load the GPROB R package. |
32 |
[1]: http://dx.doi.org/10.1126/scitranslmed.aay1548 |
33 |
|
33 |
|
34 |
``` r |
34 |
|
35 |
devtools::install_github("immunogenomics/GPROB") |
35 |
## License |
36 |
library(GPROB) |
36 |
|
37 |
``` |
37 |
Please see the [LICENSE] file for details. [Contact us] for other licensing |
38 |
|
38 |
options. |
39 |
## Synopsis |
39 |
|
40 |
|
40 |
[LICENSE]: LICENSE |
41 |
GPROB estimates the probability that each individual has a given |
41 |
[Contact us]: mailto:soumya@broadinstitute.org |
42 |
phenotype. |
42 |
|
43 |
|
43 |
|
44 |
We need three inputs: |
44 |
## Installation |
45 |
|
45 |
|
46 |
- Population prevalences of the phenotypes of interest. |
46 |
Install and load the GPROB R package. |
47 |
|
47 |
|
48 |
- Odds ratios for SNP associations with the phenotypes. |
48 |
```{r, eval = FALSE} |
49 |
|
49 |
devtools::install_github("immunogenomics/GPROB") |
50 |
- SNP genotypes (0, 1, 2) for each individual. |
50 |
library(GPROB) |
51 |
|
51 |
``` |
52 |
### Example |
52 |
|
53 |
|
53 |
## Synopsis |
54 |
Let’s use a small example with artificial data to learn how to use |
54 |
|
55 |
GPROB. |
55 |
GPROB estimates the probability that each individual has a given phenotype. |
56 |
|
56 |
|
57 |
Suppose we have 10 patients, and we know of 7 single nucleotide |
57 |
We need three inputs: |
58 |
polymorphisms (SNPs) associated with rheumatoid arthritis (RA) or |
58 |
|
59 |
systemic lupus erythematosus (SLE). |
59 |
- Population prevalences of the phenotypes of interest. |
60 |
|
60 |
|
61 |
#### Prevalence |
61 |
- Odds ratios for SNP associations with the phenotypes. |
62 |
|
62 |
|
63 |
First, we should find out the prevalence of RA and SLE in the population |
63 |
- SNP genotypes (0, 1, 2) for each individual. |
64 |
that is representative of our patients. |
64 |
|
65 |
|
65 |
### Example |
66 |
``` r |
66 |
|
67 |
prevalence <- c("RA" = 0.001, "SLE" = 0.001) |
67 |
Let's use a small example with artificial data to learn how to use GPROB. |
68 |
``` |
68 |
|
69 |
|
69 |
Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms |
70 |
#### Odds Ratios |
70 |
(SNPs) associated with rheumatoid arthritis (RA) or systemic lupus |
71 |
|
71 |
erythematosus (SLE). |
72 |
Next, we need to obtain the odds ratios (ORs) from published genome-wide |
72 |
|
73 |
association studies (GWAS). We should be careful to note which alleles |
73 |
#### Prevalence |
74 |
are associated with the phenotype to compute the risk in the correct |
74 |
|
75 |
direction. |
75 |
First, we should find out the prevalence of RA and SLE in the population that |
76 |
|
76 |
is representative of our patients. |
77 |
``` r |
77 |
|
78 |
or <- read.delim( |
78 |
```{r} |
79 |
sep = "", |
79 |
prevalence <- c("RA" = 0.001, "SLE" = 0.001) |
80 |
row.names = 1, |
80 |
``` |
81 |
text = " |
81 |
|
82 |
snp RA SLE |
82 |
#### Odds Ratios |
83 |
SNP1 1.0 0.4 |
83 |
|
84 |
SNP2 1.0 0.9 |
84 |
Next, we need to obtain the odds ratios (ORs) from published genome-wide |
85 |
SNP3 1.0 1.3 |
85 |
association studies (GWAS). We should be careful to note which alleles are |
86 |
SNP4 0.4 1.6 |
86 |
associated with the phenotype to compute the risk in the correct direction. |
87 |
SNP5 0.9 1.0 |
87 |
|
88 |
SNP6 1.3 1.0 |
88 |
```{r} |
89 |
SNP7 1.6 1.0 |
89 |
or <- read.delim( |
90 |
") |
90 |
sep = "", |
91 |
or <- as.matrix(or) |
91 |
row.names = 1, |
92 |
``` |
92 |
text = " |
93 |
|
93 |
snp RA SLE |
94 |
#### Genotypes |
94 |
SNP1 1.0 0.4 |
95 |
|
95 |
SNP2 1.0 0.9 |
96 |
Finally, we need the genotype data for each of our 10 patients. Here, |
96 |
SNP3 1.0 1.3 |
97 |
the data is coded in the form (0, 1, 2) to indicate the number of copies |
97 |
SNP4 0.4 1.6 |
98 |
of the risk allele. |
98 |
SNP5 0.9 1.0 |
99 |
|
99 |
SNP6 1.3 1.0 |
100 |
``` r |
100 |
SNP7 1.6 1.0 |
101 |
geno <- read.delim( |
101 |
") |
102 |
sep = "", |
102 |
or <- as.matrix(or) |
103 |
row.names = 1, |
103 |
``` |
104 |
text = " |
104 |
|
105 |
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 |
105 |
#### Genotypes |
106 |
1 0 1 0 2 1 0 |
106 |
|
107 |
2 0 0 1 0 2 2 |
107 |
Finally, we need the genotype data for each of our 10 patients. Here, the data |
108 |
3 1 0 1 1 0 2 |
108 |
is coded in the form (0, 1, 2) to indicate the number of copies of the risk |
109 |
4 1 1 0 2 0 0 |
109 |
allele. |
110 |
5 0 1 1 1 1 0 |
110 |
|
111 |
6 0 0 1 3 0 2 |
111 |
```{r} |
112 |
7 2 2 2 2 2 2 |
112 |
geno <- read.delim( |
113 |
8 1 2 0 2 1 1 |
113 |
sep = "", |
114 |
9 0 2 1 NA 1 2 |
114 |
row.names = 1, |
115 |
10 1 0 2 2 2 0 |
115 |
text = " |
116 |
") |
116 |
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 |
117 |
geno <- as.matrix(geno) |
117 |
1 0 1 0 2 1 0 |
118 |
``` |
118 |
2 0 0 1 0 2 2 |
119 |
|
119 |
3 1 0 1 1 0 2 |
120 |
#### Dealing with missing or invalid data |
120 |
4 1 1 0 2 0 0 |
121 |
|
121 |
5 0 1 1 1 1 0 |
122 |
Before we run the `GPROB()` function, we need to deal with invalid and |
122 |
6 0 0 1 3 0 2 |
123 |
missing data. |
123 |
7 2 2 2 2 2 2 |
124 |
|
124 |
8 1 2 0 2 1 1 |
125 |
We remove individuals who have `NA` for any SNP: |
125 |
9 0 2 1 NA 1 2 |
126 |
|
126 |
10 1 0 2 2 2 0 |
127 |
``` r |
127 |
") |
128 |
ix <- apply(geno, 1, function(x) !any(is.na(x))) |
128 |
geno <- as.matrix(geno) |
129 |
geno <- geno[ix,] |
129 |
``` |
130 |
``` |
130 |
|
131 |
|
131 |
#### Dealing with missing or invalid data |
132 |
We remove individuals who have invalid allele counts: |
132 |
|
133 |
|
133 |
Before we run the `GPROB()` function, we need to deal with invalid and missing |
134 |
``` r |
134 |
data. |
135 |
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2)) |
135 |
|
136 |
geno <- geno[ix,] |
136 |
We remove individuals who have `NA` for any SNP: |
137 |
``` |
137 |
|
138 |
|
138 |
```{r} |
139 |
And we make sure that we use the same SNPs in the `or` and `geno` |
139 |
ix <- apply(geno, 1, function(x) !any(is.na(x))) |
140 |
matrices: |
140 |
geno <- geno[ix,] |
141 |
|
141 |
``` |
142 |
``` r |
142 |
|
143 |
or <- or[colnames(geno),] |
143 |
We remove individuals who have invalid allele counts: |
144 |
``` |
144 |
|
145 |
|
145 |
```{r} |
146 |
#### Run GPROB |
146 |
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2)) |
147 |
|
147 |
geno <- geno[ix,] |
148 |
Then we can run the GPROB function to estimate probabilities: |
148 |
``` |
149 |
|
149 |
|
150 |
``` r |
150 |
And we make sure that we use the same SNPs in the `or` and `geno` matrices: |
151 |
library(GPROB) |
151 |
|
152 |
res <- GPROB(prevalence, or, geno) |
152 |
```{r} |
153 |
res |
153 |
or <- or[colnames(geno),] |
154 |
#> $pop_prob |
154 |
``` |
155 |
#> RA SLE |
155 |
|
156 |
#> 1 0.0003556116 0.0017797758 |
156 |
#### Run GPROB |
157 |
#> 2 0.0033703376 0.0010049932 |
157 |
|
158 |
#> 3 0.0016672084 0.0006434285 |
158 |
Then we can run the GPROB function to estimate probabilities: |
159 |
#> 4 0.0003951084 0.0007126714 |
159 |
|
160 |
#> 5 0.0008885550 0.0014465506 |
160 |
```{r} |
161 |
#> 7 0.0005407850 0.0004337103 |
161 |
library(GPROB) |
162 |
#> 8 0.0004622457 0.0006414499 |
162 |
res <- GPROB(prevalence, or, geno) |
163 |
#> 10 0.0003200618 0.0013374018 |
163 |
res |
164 |
#> |
164 |
``` |
165 |
#> $cond_prob |
165 |
|
166 |
#> RA SLE |
166 |
In this example, we might interpret the numbers as follows: |
167 |
#> 1 0.1665326 0.8334674 |
167 |
|
168 |
#> 2 0.7703046 0.2296954 |
168 |
- Individual 2 has RA with probability 0.003, given individual genetic risk |
169 |
#> 3 0.7215363 0.2784637 |
169 |
factors, disease prevalence, and the number of patients used in genetic risk |
170 |
#> 4 0.3566669 0.6433331 |
170 |
score calculations. |
171 |
#> 5 0.3805203 0.6194797 |
171 |
|
172 |
#> 7 0.5549386 0.4450614 |
172 |
- Individual 2 has RA with probability 0.77, conditional on the additional |
173 |
#> 8 0.4188163 0.5811837 |
173 |
assumption that individual 2 has either RA or SLE. |
174 |
#> 10 0.1931034 0.8068966 |
174 |
|
175 |
``` |
175 |
## Calculations, step by step |
176 |
|
176 |
|
177 |
In this example, we might interpret the numbers as follows: |
177 |
Let's go through each step of GPROB to understand how how it works. |
178 |
|
178 |
|
179 |
- Individual 2 has RA with probability 0.003, given individual genetic |
179 |
The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for disease *k* is defined as: |
180 |
risk factors, disease prevalence, and the number of patients used in |
180 |
|
181 |
genetic risk score calculations. |
181 |
<p align="center"> |
182 |
|
182 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/> |
183 |
- Individual 2 has RA with probability 0.77, conditional on the |
183 |
</p> |
184 |
additional assumption that individual 2 has either RA or SLE. |
184 |
|
185 |
|
185 |
where: |
186 |
## Calculations, step by step |
186 |
|
187 |
|
187 |
- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in individual *i* |
188 |
Let’s go through each step of GPROB to understand how how it works. |
188 |
|
189 |
|
189 |
- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in a genome-wide |
190 |
The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for |
190 |
association study (GWAS) for disease *k* |
191 |
disease *k* is defined as: |
191 |
|
192 |
|
192 |
<table><tr><td> |
193 |
<p align="center"> |
193 |
<b>Note:</b> We might want to consider shrinking the risk by some factor (e.g. |
194 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/> |
194 |
0.5) to correct for possible overestimation of the effect sizes due to |
195 |
</p> |
195 |
publication bias. In other words, consider running <code>geno <- 0.5 * |
196 |
|
196 |
geno</code>. |
197 |
where: |
197 |
</td></tr></table> |
198 |
|
198 |
|
199 |
- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in |
199 |
```{r} |
200 |
individual *i* |
200 |
risk <- geno %*% log(or) |
201 |
|
201 |
risk |
202 |
- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in |
202 |
``` |
203 |
a genome-wide association study (GWAS) for disease *k* |
203 |
|
204 |
|
204 |
The known prevalence <i>V<sub>k</sub></i> of each disease in the general population: |
205 |
<table> |
205 |
|
206 |
<tr> |
206 |
```{r} |
207 |
<td> |
207 |
prevalence |
208 |
<b>Note:</b> We might want to consider shrinking the risk by some factor |
208 |
``` |
209 |
(e.g. 0.5) to correct for possible overestimation of the effect sizes |
209 |
|
210 |
due to publication bias. In other words, consider running <code>geno \<- |
210 |
We can calculate the population level probability <i>P<sub>ki</sub></i> that each individual |
211 |
0.5 \* geno</code>. |
211 |
has the disease. |
212 |
</td> |
212 |
|
213 |
</tr> |
213 |
<p align="center"> |
214 |
</table> |
214 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/> |
215 |
|
215 |
</p> |
216 |
``` r |
216 |
|
217 |
risk <- geno %*% log(or) |
217 |
We find <i>α<sub>k</sub></i> for each disease *k* by minimizing |
218 |
risk |
218 |
<i>(P̅<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that the |
219 |
#> RA SLE |
219 |
mean probability <i>P̅<sub>k</sub></i> across individuals is equal to the |
220 |
#> 1 -1.9379420 0.83464674 |
220 |
known prevalence <i>V<sub>k</sub></i> of the disease in the population. |
221 |
#> 2 0.3140075 0.26236426 |
221 |
|
222 |
#> 3 -0.3915622 -0.18392284 |
222 |
```{r} |
223 |
#> 4 -1.8325815 -0.08164399 |
223 |
# @param alpha A constant that we choose manually. |
224 |
#> 5 -1.0216512 0.62700738 |
224 |
# @param risk A vector of risk scores for individuals. |
225 |
#> 7 -1.5185740 -0.57856671 |
225 |
# @returns A vector of probabilities for each individual. |
226 |
#> 8 -1.6755777 -0.18700450 |
226 |
prob <- function(alpha, risk) { |
227 |
#> 10 -2.0433025 0.54844506 |
227 |
1 / ( |
228 |
``` |
228 |
1 + exp(alpha - risk) |
229 |
|
229 |
) |
230 |
The known prevalence <i>V<sub>k</sub></i> of each disease in the general |
230 |
} |
231 |
population: |
231 |
alpha <- sapply(seq(ncol(risk)), function(i) { |
232 |
|
232 |
o <- optimize( |
233 |
``` r |
233 |
f = function(alpha, risk, prevalence) { |
234 |
prevalence |
234 |
( mean(prob(alpha, risk)) - prevalence ) ^ 2 |
235 |
#> RA SLE |
235 |
}, |
236 |
#> 0.001 0.001 |
236 |
interval = c(-100, 100), |
237 |
``` |
237 |
risk = risk[,i], |
238 |
|
238 |
prevalence = prevalence[i] |
239 |
We can calculate the population level probability <i>P<sub>ki</sub></i> |
239 |
) |
240 |
that each individual has the disease. |
240 |
o$minimum |
241 |
|
241 |
}) |
242 |
<p align="center"> |
242 |
alpha |
243 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/> |
243 |
``` |
244 |
</p> |
244 |
|
245 |
|
245 |
Now that we have computed alpha, we can compute the population-level |
246 |
We find <i>α<sub>k</sub></i> for each disease *k* by minimizing |
246 |
probabilities of disease for each individual. |
247 |
<i>(P̅<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that |
247 |
|
248 |
the mean probability <i>P̅<sub>k</sub></i> across individuals is equal to |
248 |
```{r} |
249 |
the known prevalence <i>V<sub>k</sub></i> of the disease in the |
249 |
# population-level disease probability |
250 |
population. |
250 |
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i])) |
251 |
|
251 |
p |
252 |
``` r |
252 |
``` |
253 |
# @param alpha A constant that we choose manually. |
253 |
|
254 |
# @param risk A vector of risk scores for individuals. |
254 |
Next we assume that each individual has one of the diseases: |
255 |
# @returns A vector of probabilities for each individual. |
255 |
|
256 |
prob <- function(alpha, risk) { |
256 |
<p align="center"> |
257 |
1 / ( |
257 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/> |
258 |
1 + exp(alpha - risk) |
258 |
</p> |
259 |
) |
259 |
|
260 |
} |
260 |
Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of each |
261 |
alpha <- sapply(seq(ncol(risk)), function(i) { |
261 |
disease *k*: |
262 |
o <- optimize( |
262 |
|
263 |
f = function(alpha, risk, prevalence) { |
263 |
<p align="center"> |
264 |
( mean(prob(alpha, risk)) - prevalence ) ^ 2 |
264 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/> |
265 |
}, |
265 |
</p> |
266 |
interval = c(-100, 100), |
266 |
|
267 |
risk = risk[,i], |
267 |
```{r} |
268 |
prevalence = prevalence[i] |
268 |
# patient-level disease probability |
269 |
) |
269 |
cp <- p / rowSums(p) |
270 |
o$minimum |
270 |
cp |
271 |
}) |
271 |
``` |
272 |
alpha |
272 |
|
273 |
#> [1] 6.003374 7.164133 |
|
|
274 |
``` |
|
|
275 |
|
|
|
276 |
Now that we have computed alpha, we can compute the population-level |
|
|
277 |
probabilities of disease for each individual. |
|
|
278 |
|
|
|
279 |
``` r |
|
|
280 |
# population-level disease probability |
|
|
281 |
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i])) |
|
|
282 |
p |
|
|
283 |
#> [,1] [,2] |
|
|
284 |
#> 1 0.0003556116 0.0017797758 |
|
|
285 |
#> 2 0.0033703376 0.0010049932 |
|
|
286 |
#> 3 0.0016672084 0.0006434285 |
|
|
287 |
#> 4 0.0003951084 0.0007126714 |
|
|
288 |
#> 5 0.0008885550 0.0014465506 |
|
|
289 |
#> 7 0.0005407850 0.0004337103 |
|
|
290 |
#> 8 0.0004622457 0.0006414499 |
|
|
291 |
#> 10 0.0003200618 0.0013374018 |
|
|
292 |
``` |
|
|
293 |
|
|
|
294 |
Next we assume that each individual has one of the diseases: |
|
|
295 |
|
|
|
296 |
<p align="center"> |
|
|
297 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/> |
|
|
298 |
</p> |
|
|
299 |
|
|
|
300 |
Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of |
|
|
301 |
each disease *k*: |
|
|
302 |
|
|
|
303 |
<p align="center"> |
|
|
304 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/> |
|
|
305 |
</p> |
|
|
306 |
|
|
|
307 |
``` r |
|
|
308 |
# patient-level disease probability |
|
|
309 |
cp <- p / rowSums(p) |
|
|
310 |
cp |
|
|
311 |
#> [,1] [,2] |
|
|
312 |
#> 1 0.1665326 0.8334674 |
|
|
313 |
#> 2 0.7703046 0.2296954 |
|
|
314 |
#> 3 0.7215363 0.2784637 |
|
|
315 |
#> 4 0.3566669 0.6433331 |
|
|
316 |
#> 5 0.3805203 0.6194797 |
|
|
317 |
#> 7 0.5549386 0.4450614 |
|
|
318 |
#> 8 0.4188163 0.5811837 |
|
|
319 |
#> 10 0.1931034 0.8068966 |
|
|
320 |
``` |
|
|