|
a |
|
b/README.Rmd |
|
|
1 |
--- |
|
|
2 |
output: github_document |
|
|
3 |
--- |
|
|
4 |
|
|
|
5 |
```{r, include = FALSE} |
|
|
6 |
knitr::opts_chunk$set( |
|
|
7 |
collapse = TRUE, |
|
|
8 |
comment = "#>" |
|
|
9 |
) |
|
|
10 |
``` |
|
|
11 |
|
|
|
12 |
# GPROB <img src="man/figures/gprob.svg" width="181px" align="right" /> |
|
|
13 |
|
|
|
14 |
Multiple diseases can present with similar initial symptoms, making it |
|
|
15 |
difficult to clinically differentiate between these conditions. GPROB uses |
|
|
16 |
patients' genetic information to help prioritize a diagnosis. This genetic |
|
|
17 |
diagnostic tool can be applied to any situation with phenotypically similar |
|
|
18 |
diseases with different underlying genetics. |
|
|
19 |
|
|
|
20 |
<!-- badges: start --> |
|
|
21 |
[](https://github.com/immunogenomics/GPROB/actions) |
|
|
22 |
<!-- badges: end --> |
|
|
23 |
|
|
|
24 |
|
|
|
25 |
## Citation |
|
|
26 |
|
|
|
27 |
Please cite: |
|
|
28 |
|
|
|
29 |
- Knevel, R. et al. [Using genetics to prioritize diagnoses for rheumatology |
|
|
30 |
outpatients with inflammatory arthritis.][1] Sci. Transl. Med. 12, (2020) |
|
|
31 |
|
|
|
32 |
[1]: http://dx.doi.org/10.1126/scitranslmed.aay1548 |
|
|
33 |
|
|
|
34 |
|
|
|
35 |
## License |
|
|
36 |
|
|
|
37 |
Please see the [LICENSE] file for details. [Contact us] for other licensing |
|
|
38 |
options. |
|
|
39 |
|
|
|
40 |
[LICENSE]: LICENSE |
|
|
41 |
[Contact us]: mailto:soumya@broadinstitute.org |
|
|
42 |
|
|
|
43 |
|
|
|
44 |
## Installation |
|
|
45 |
|
|
|
46 |
Install and load the GPROB R package. |
|
|
47 |
|
|
|
48 |
```{r, eval = FALSE} |
|
|
49 |
devtools::install_github("immunogenomics/GPROB") |
|
|
50 |
library(GPROB) |
|
|
51 |
``` |
|
|
52 |
|
|
|
53 |
## Synopsis |
|
|
54 |
|
|
|
55 |
GPROB estimates the probability that each individual has a given phenotype. |
|
|
56 |
|
|
|
57 |
We need three inputs: |
|
|
58 |
|
|
|
59 |
- Population prevalences of the phenotypes of interest. |
|
|
60 |
|
|
|
61 |
- Odds ratios for SNP associations with the phenotypes. |
|
|
62 |
|
|
|
63 |
- SNP genotypes (0, 1, 2) for each individual. |
|
|
64 |
|
|
|
65 |
### Example |
|
|
66 |
|
|
|
67 |
Let's use a small example with artificial data to learn how to use GPROB. |
|
|
68 |
|
|
|
69 |
Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms |
|
|
70 |
(SNPs) associated with rheumatoid arthritis (RA) or systemic lupus |
|
|
71 |
erythematosus (SLE). |
|
|
72 |
|
|
|
73 |
#### Prevalence |
|
|
74 |
|
|
|
75 |
First, we should find out the prevalence of RA and SLE in the population that |
|
|
76 |
is representative of our patients. |
|
|
77 |
|
|
|
78 |
```{r} |
|
|
79 |
prevalence <- c("RA" = 0.001, "SLE" = 0.001) |
|
|
80 |
``` |
|
|
81 |
|
|
|
82 |
#### Odds Ratios |
|
|
83 |
|
|
|
84 |
Next, we need to obtain the odds ratios (ORs) from published genome-wide |
|
|
85 |
association studies (GWAS). We should be careful to note which alleles are |
|
|
86 |
associated with the phenotype to compute the risk in the correct direction. |
|
|
87 |
|
|
|
88 |
```{r} |
|
|
89 |
or <- read.delim( |
|
|
90 |
sep = "", |
|
|
91 |
row.names = 1, |
|
|
92 |
text = " |
|
|
93 |
snp RA SLE |
|
|
94 |
SNP1 1.0 0.4 |
|
|
95 |
SNP2 1.0 0.9 |
|
|
96 |
SNP3 1.0 1.3 |
|
|
97 |
SNP4 0.4 1.6 |
|
|
98 |
SNP5 0.9 1.0 |
|
|
99 |
SNP6 1.3 1.0 |
|
|
100 |
SNP7 1.6 1.0 |
|
|
101 |
") |
|
|
102 |
or <- as.matrix(or) |
|
|
103 |
``` |
|
|
104 |
|
|
|
105 |
#### Genotypes |
|
|
106 |
|
|
|
107 |
Finally, we need the genotype data for each of our 10 patients. Here, the data |
|
|
108 |
is coded in the form (0, 1, 2) to indicate the number of copies of the risk |
|
|
109 |
allele. |
|
|
110 |
|
|
|
111 |
```{r} |
|
|
112 |
geno <- read.delim( |
|
|
113 |
sep = "", |
|
|
114 |
row.names = 1, |
|
|
115 |
text = " |
|
|
116 |
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 |
|
|
117 |
1 0 1 0 2 1 0 |
|
|
118 |
2 0 0 1 0 2 2 |
|
|
119 |
3 1 0 1 1 0 2 |
|
|
120 |
4 1 1 0 2 0 0 |
|
|
121 |
5 0 1 1 1 1 0 |
|
|
122 |
6 0 0 1 3 0 2 |
|
|
123 |
7 2 2 2 2 2 2 |
|
|
124 |
8 1 2 0 2 1 1 |
|
|
125 |
9 0 2 1 NA 1 2 |
|
|
126 |
10 1 0 2 2 2 0 |
|
|
127 |
") |
|
|
128 |
geno <- as.matrix(geno) |
|
|
129 |
``` |
|
|
130 |
|
|
|
131 |
#### Dealing with missing or invalid data |
|
|
132 |
|
|
|
133 |
Before we run the `GPROB()` function, we need to deal with invalid and missing |
|
|
134 |
data. |
|
|
135 |
|
|
|
136 |
We remove individuals who have `NA` for any SNP: |
|
|
137 |
|
|
|
138 |
```{r} |
|
|
139 |
ix <- apply(geno, 1, function(x) !any(is.na(x))) |
|
|
140 |
geno <- geno[ix,] |
|
|
141 |
``` |
|
|
142 |
|
|
|
143 |
We remove individuals who have invalid allele counts: |
|
|
144 |
|
|
|
145 |
```{r} |
|
|
146 |
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2)) |
|
|
147 |
geno <- geno[ix,] |
|
|
148 |
``` |
|
|
149 |
|
|
|
150 |
And we make sure that we use the same SNPs in the `or` and `geno` matrices: |
|
|
151 |
|
|
|
152 |
```{r} |
|
|
153 |
or <- or[colnames(geno),] |
|
|
154 |
``` |
|
|
155 |
|
|
|
156 |
#### Run GPROB |
|
|
157 |
|
|
|
158 |
Then we can run the GPROB function to estimate probabilities: |
|
|
159 |
|
|
|
160 |
```{r} |
|
|
161 |
library(GPROB) |
|
|
162 |
res <- GPROB(prevalence, or, geno) |
|
|
163 |
res |
|
|
164 |
``` |
|
|
165 |
|
|
|
166 |
In this example, we might interpret the numbers as follows: |
|
|
167 |
|
|
|
168 |
- Individual 2 has RA with probability 0.003, given individual genetic risk |
|
|
169 |
factors, disease prevalence, and the number of patients used in genetic risk |
|
|
170 |
score calculations. |
|
|
171 |
|
|
|
172 |
- Individual 2 has RA with probability 0.77, conditional on the additional |
|
|
173 |
assumption that individual 2 has either RA or SLE. |
|
|
174 |
|
|
|
175 |
## Calculations, step by step |
|
|
176 |
|
|
|
177 |
Let's go through each step of GPROB to understand how how it works. |
|
|
178 |
|
|
|
179 |
The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for disease *k* is defined as: |
|
|
180 |
|
|
|
181 |
<p align="center"> |
|
|
182 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/> |
|
|
183 |
</p> |
|
|
184 |
|
|
|
185 |
where: |
|
|
186 |
|
|
|
187 |
- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in individual *i* |
|
|
188 |
|
|
|
189 |
- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in a genome-wide |
|
|
190 |
association study (GWAS) for disease *k* |
|
|
191 |
|
|
|
192 |
<table><tr><td> |
|
|
193 |
<b>Note:</b> We might want to consider shrinking the risk by some factor (e.g. |
|
|
194 |
0.5) to correct for possible overestimation of the effect sizes due to |
|
|
195 |
publication bias. In other words, consider running <code>geno <- 0.5 * |
|
|
196 |
geno</code>. |
|
|
197 |
</td></tr></table> |
|
|
198 |
|
|
|
199 |
```{r} |
|
|
200 |
risk <- geno %*% log(or) |
|
|
201 |
risk |
|
|
202 |
``` |
|
|
203 |
|
|
|
204 |
The known prevalence <i>V<sub>k</sub></i> of each disease in the general population: |
|
|
205 |
|
|
|
206 |
```{r} |
|
|
207 |
prevalence |
|
|
208 |
``` |
|
|
209 |
|
|
|
210 |
We can calculate the population level probability <i>P<sub>ki</sub></i> that each individual |
|
|
211 |
has the disease. |
|
|
212 |
|
|
|
213 |
<p align="center"> |
|
|
214 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/> |
|
|
215 |
</p> |
|
|
216 |
|
|
|
217 |
We find <i>α<sub>k</sub></i> for each disease *k* by minimizing |
|
|
218 |
<i>(P̅<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that the |
|
|
219 |
mean probability <i>P̅<sub>k</sub></i> across individuals is equal to the |
|
|
220 |
known prevalence <i>V<sub>k</sub></i> of the disease in the population. |
|
|
221 |
|
|
|
222 |
```{r} |
|
|
223 |
# @param alpha A constant that we choose manually. |
|
|
224 |
# @param risk A vector of risk scores for individuals. |
|
|
225 |
# @returns A vector of probabilities for each individual. |
|
|
226 |
prob <- function(alpha, risk) { |
|
|
227 |
1 / ( |
|
|
228 |
1 + exp(alpha - risk) |
|
|
229 |
) |
|
|
230 |
} |
|
|
231 |
alpha <- sapply(seq(ncol(risk)), function(i) { |
|
|
232 |
o <- optimize( |
|
|
233 |
f = function(alpha, risk, prevalence) { |
|
|
234 |
( mean(prob(alpha, risk)) - prevalence ) ^ 2 |
|
|
235 |
}, |
|
|
236 |
interval = c(-100, 100), |
|
|
237 |
risk = risk[,i], |
|
|
238 |
prevalence = prevalence[i] |
|
|
239 |
) |
|
|
240 |
o$minimum |
|
|
241 |
}) |
|
|
242 |
alpha |
|
|
243 |
``` |
|
|
244 |
|
|
|
245 |
Now that we have computed alpha, we can compute the population-level |
|
|
246 |
probabilities of disease for each individual. |
|
|
247 |
|
|
|
248 |
```{r} |
|
|
249 |
# population-level disease probability |
|
|
250 |
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i])) |
|
|
251 |
p |
|
|
252 |
``` |
|
|
253 |
|
|
|
254 |
Next we assume that each individual has one of the diseases: |
|
|
255 |
|
|
|
256 |
<p align="center"> |
|
|
257 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/> |
|
|
258 |
</p> |
|
|
259 |
|
|
|
260 |
Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of each |
|
|
261 |
disease *k*: |
|
|
262 |
|
|
|
263 |
<p align="center"> |
|
|
264 |
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/> |
|
|
265 |
</p> |
|
|
266 |
|
|
|
267 |
```{r} |
|
|
268 |
# patient-level disease probability |
|
|
269 |
cp <- p / rowSums(p) |
|
|
270 |
cp |
|
|
271 |
``` |
|
|
272 |
|