Assuming that you are using the SHEPHERD models that we provide
MY_TEST_DATA
in project_config.py
--only_test_data
MY_SPL_DATA
and MY_SPL_INDEX_DATA
in project_config.py
predict.py
to generate predictions for your patients--run_type causal_gene_discovery
with --best_ckpt checkpoints/causal_gene_discovery
)--patient_data my_data
)The output of predict.py
consists of:
- Dataframe of scores for each patient (scores.csv
)
- For causal gene discovery: Each patient's list of candidate genes are scored. The columns of the table are: patient ID, identifier of the candidate gene, similarity score, and binary correct label.
- For patients-like-me identification: All patients in the input jsonlines file are scored. The columns of the table are: patient ID, identifier of the candidate patient, similarity score, and binary correct label. Note that if you would like to compare only a subset of the patients, you can subset the scores of those patients and re-normalize.
- For novel disease characterization: Either all diseases in the knowledge graph or all Orphanet diseases are scored. The columns of the table are: patient ID, identifier of candidate disease (MONDO or Orphanet name), similarity score, and binary correct label.
- Phenotype attention (phenotype_attn.csv
)
- Patient embeddings (phenotype_embeddings.pth
)
- (Only for novel disease characterization) Disease embeddings (disease_embeddings.pth
)