|
a |
|
b/data_prep/README.md |
|
|
1 |
# Prepare your own patient dataset |
|
|
2 |
|
|
|
3 |
## Preprocessing steps |
|
|
4 |
Your preprocessing script must do the following: |
|
|
5 |
1. Map genes to Ensembl IDs |
|
|
6 |
2. Map phenotypes to the 2019 version of HPO |
|
|
7 |
3. Output a jsonlines file where each json (i.e., line in the file) contains information for a single patient |
|
|
8 |
|
|
|
9 |
Please refer to the `create_mygene2_cohort/preprocess_mygene2.py` for an example preprocessing script. |
|
|
10 |
|
|
|
11 |
## Patient information |
|
|
12 |
|
|
|
13 |
An example patient from the simulated patients dataset: |
|
|
14 |
|
|
|
15 |
``` |
|
|
16 |
{ |
|
|
17 |
"id": 9, |
|
|
18 |
"positive_phenotypes": ["HP:0000221", "HP:0000232", "HP:0001155", "HP:0005692", "HP:0012471", "HP:0100540", "HP:0001999", "HP:0001249", "HP:0010285", "HP:0000924", "HP:0004459"], |
|
|
19 |
"all_candidate_genes": ["ENSG00000196277", "ENSG00000104899", "ENSG00000143156", "ENSG00000088451", "ENSG00000157557", "ENSG00000165125", "ENSG00000157766", "ENSG00000108821", "ENSG00000142655", "ENSG00000184470", "ENSG00000157119", "ENSG00000069431", "ENSG00000131828", "ENSG00000179111", "ENSG00000168646"], |
|
|
20 |
"true_genes": ["ENSG00000069431"], |
|
|
21 |
"true_diseases": ["966"] |
|
|
22 |
} |
|
|
23 |
``` |
|
|
24 |
|
|
|
25 |
### Required |
|
|
26 |
|
|
|
27 |
The minimal information required for each patient are: |
|
|
28 |
- Patient ID ("id") |
|
|
29 |
- List of phenotypes present in the patient as HPO terms ("positive_phenotypes") |
|
|
30 |
|
|
|
31 |
To run causal gene discovery, the json must also include: |
|
|
32 |
- List of all candidate genes as Ensembl IDs ("all_candidate_genes") |
|
|
33 |
|
|
|
34 |
To run patients-like-me identification or novel disease characterization, the json does not require any additional information. |
|
|
35 |
|
|
|
36 |
### Optional |
|
|
37 |
- Causal genes ("true_genes"). *If available, please provide causal genes as Ensembl IDs.* |
|
|
38 |
- Disease names ("true_diseases"). *If available, please provide true disease names as MONDO IDs.* |
|
|
39 |
- Omim ID |
|
|
40 |
- Orphanet ID |
|
|
41 |
- Orphanet category |