Diff of /data_prep/README.md [000000] .. [db6163]

Switch to unified view

a b/data_prep/README.md
1
# Prepare your own patient dataset
2
3
## Preprocessing steps
4
Your preprocessing script must do the following:
5
1. Map genes to Ensembl IDs
6
2. Map phenotypes to the 2019 version of HPO
7
3. Output a jsonlines file where each json (i.e., line in the file) contains information for a single patient
8
9
Please refer to the `create_mygene2_cohort/preprocess_mygene2.py` for an example preprocessing script.
10
11
## Patient information
12
13
An example patient from the simulated patients dataset:
14
15
```
16
{
17
 "id": 9,
18
 "positive_phenotypes": ["HP:0000221", "HP:0000232", "HP:0001155", "HP:0005692", "HP:0012471", "HP:0100540", "HP:0001999", "HP:0001249", "HP:0010285", "HP:0000924", "HP:0004459"],
19
 "all_candidate_genes": ["ENSG00000196277", "ENSG00000104899", "ENSG00000143156", "ENSG00000088451", "ENSG00000157557", "ENSG00000165125", "ENSG00000157766", "ENSG00000108821", "ENSG00000142655", "ENSG00000184470", "ENSG00000157119", "ENSG00000069431", "ENSG00000131828", "ENSG00000179111", "ENSG00000168646"],
20
 "true_genes": ["ENSG00000069431"],
21
 "true_diseases": ["966"]
22
}
23
```
24
25
### Required
26
27
The minimal information required for each patient are:
28
- Patient ID ("id")
29
- List of phenotypes present in the patient as HPO terms ("positive_phenotypes")
30
31
To run causal gene discovery, the json must also include:
32
- List of all candidate genes as Ensembl IDs ("all_candidate_genes")
33
34
To run patients-like-me identification or novel disease characterization, the json does not require any additional information.
35
36
### Optional
37
- Causal genes ("true_genes"). *If available, please provide causal genes as Ensembl IDs.*
38
- Disease names ("true_diseases"). *If available, please provide true disease names as MONDO IDs.*
39
- Omim ID
40
- Orphanet ID
41
- Orphanet category