[4ed68b]: / README.Rmd

Download this file

273 lines (200 with data), 6.8 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# GPROB <img src="man/figures/gprob.svg" width="181px" align="right" />

Multiple diseases can present with similar initial symptoms, making it
difficult to clinically differentiate between these conditions. GPROB uses
patients' genetic information to help prioritize a diagnosis. This genetic
diagnostic tool can be applied to any situation with phenotypically similar
diseases with different underlying genetics.

<!-- badges: start -->
[![R build status](https://github.com/immunogenomics/GPROB/workflows/R-CMD-check/badge.svg)](https://github.com/immunogenomics/GPROB/actions)
<!-- badges: end -->


## Citation

Please cite:

- Knevel, R. et al. [Using genetics to prioritize diagnoses for rheumatology
  outpatients with inflammatory arthritis.][1] Sci. Transl. Med. 12, (2020)

[1]: http://dx.doi.org/10.1126/scitranslmed.aay1548


## License

Please see the [LICENSE] file for details. [Contact us] for other licensing
options.

[LICENSE]: LICENSE
[Contact us]: mailto:soumya@broadinstitute.org


## Installation

Install and load the GPROB R package.

```{r, eval = FALSE}
devtools::install_github("immunogenomics/GPROB")
library(GPROB)
```

## Synopsis

GPROB estimates the probability that each individual has a given phenotype.

We need three inputs:

- Population prevalences of the phenotypes of interest.

- Odds ratios for SNP associations with the phenotypes.

- SNP genotypes (0, 1, 2) for each individual.

### Example

Let's use a small example with artificial data to learn how to use GPROB.

Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms
(SNPs) associated with rheumatoid arthritis (RA) or systemic lupus
erythematosus (SLE).

#### Prevalence

First, we should find out the prevalence of RA and SLE in the population that
is representative of our patients.

```{r}
prevalence <- c("RA" = 0.001, "SLE" = 0.001)
```

#### Odds Ratios

Next, we need to obtain the odds ratios (ORs) from published genome-wide
association studies (GWAS). We should be careful to note which alleles are
associated with the phenotype to compute the risk in the correct direction.

```{r}
or <- read.delim(
  sep = "",
  row.names = 1,
  text = "
snp  RA SLE
SNP1 1.0 0.4
SNP2 1.0 0.9
SNP3 1.0 1.3
SNP4 0.4 1.6
SNP5 0.9 1.0
SNP6 1.3 1.0
SNP7 1.6 1.0
")
or <- as.matrix(or)
```

#### Genotypes

Finally, we need the genotype data for each of our 10 patients. Here, the data
is coded in the form (0, 1, 2) to indicate the number of copies of the risk
allele.

```{r}
geno <- read.delim(
  sep = "",
  row.names = 1,
  text = "
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
 1    0    1    0    2    1    0
 2    0    0    1    0    2    2
 3    1    0    1    1    0    2
 4    1    1    0    2    0    0
 5    0    1    1    1    1    0
 6    0    0    1    3    0    2
 7    2    2    2    2    2    2
 8    1    2    0    2    1    1
 9    0    2    1   NA    1    2
10    1    0    2    2    2    0
")
geno <- as.matrix(geno)
```

#### Dealing with missing or invalid data

Before we run the `GPROB()` function, we need to deal with invalid and missing
data.

We remove individuals who have `NA` for any SNP:

```{r}
ix <- apply(geno, 1, function(x) !any(is.na(x)))
geno <- geno[ix,]
```

We remove individuals who have invalid allele counts:

```{r}
ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
geno <- geno[ix,]
```

And we make sure that we use the same SNPs in the `or` and `geno` matrices:

```{r}
or <- or[colnames(geno),]
```

#### Run GPROB

Then we can run the GPROB function to estimate probabilities:

```{r}
library(GPROB)
res <- GPROB(prevalence, or, geno)
res
```

In this example, we might interpret the numbers as follows:

- Individual 2 has RA with probability 0.003, given individual genetic risk
  factors, disease prevalence, and the number of patients used in genetic risk
  score calculations.

- Individual 2 has RA with probability 0.77, conditional on the additional
  assumption that individual 2 has either RA or SLE.

## Calculations, step by step

Let's go through each step of GPROB to understand how how it works.

The genetic risk score <i>S<sub>ki</sub></i> of individual *i* for disease *k* is defined as:

<p align="center">
<img src="https://latex.codecogs.com/svg.latex?\Large&space;S_{ki}=\sum_{j}{\beta_{kj}x_{ij}}"/>
</p>

where:

- <i>x<sub>ij</sub></i> is the number of risk alleles of SNP *j* in individual *i*

- <i>β<sub>kj</sub></i> is the log odds ratio for SNP *j* reported in a genome-wide
  association study (GWAS) for disease *k*

<table><tr><td>
<b>Note:</b> We might want to consider shrinking the risk by some factor (e.g.
0.5) to correct for possible overestimation of the effect sizes due to
publication bias. In other words, consider running <code>geno <- 0.5 *
geno</code>.
</td></tr></table>

```{r}
risk <- geno %*% log(or)
risk
```

The known prevalence <i>V<sub>k</sub></i> of each disease in the general population:

```{r}
prevalence
```

We can calculate the population level probability <i>P<sub>ki</sub></i> that each individual
has the disease.

<p align="center">
<img src="https://latex.codecogs.com/svg.latex?\Large&space;P_{ki}=\frac{1}{1+\exp{(S_{ki}-\alpha_k)}}"/>
</p>

We find <i>α<sub>k</sub></i> for each disease *k* by minimizing
<i>(P&#773;<sub>k</sub> - V<sub>k</sub>)<sup>2</sup></i>. This ensures that the
mean probability <i>P&#773;<sub>k</sub></i> across individuals is equal to the
known prevalence <i>V<sub>k</sub></i> of the disease in the population.

```{r}
# @param alpha A constant that we choose manually.
# @param risk A vector of risk scores for individuals.
# @returns A vector of probabilities for each individual.
prob <- function(alpha, risk) {
  1 / (
    1 + exp(alpha - risk)
  )
}
alpha <- sapply(seq(ncol(risk)), function(i) {
  o <- optimize(
    f        = function(alpha, risk, prevalence) {
      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
    },
    interval = c(-100, 100),
    risk = risk[,i],
    prevalence = prevalence[i]
  )
  o$minimum
})
alpha
```

Now that we have computed alpha, we can compute the population-level
probabilities of disease for each individual.

```{r}
# population-level disease probability
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
p
```

Next we assume that each individual has one of the diseases:

<p align="center">
<img src="https://latex.codecogs.com/svg.latex?\Large&space;\text{Pr}(Y_k=1|(\textstyle\sum_k{Y_k})=1)"/>
</p>

Then, we calculate the conditional probability <i>C<sub>ki</sub></i> of each
disease *k*: 

<p align="center">
<img src="https://latex.codecogs.com/svg.latex?\Large&space;C_{ki}=\frac{P_{ki}}{\sum_k{P_{ki}}}"/>
</p>

```{r}
# patient-level disease probability
cp <- p / rowSums(p)
cp
```