To standardize the clinical trial outcome prediction, we create a benchmark dataset for Trial Outcome Prediction named TOP, which incorporate rich data components about clinical trials, including drug, disease and protocol (eligibility criteria).
Benchmark can be mainly divided into two parts:
- Raw Data
describes all the data sources.
- ClinicalTrial.gov
: all the clinical trials records.
- DrugBank
: molecule structures of all the drugs.
- ClinicalTable
: API for ICD-10 codes.
- MoleculeNet
: ADMET data.
- Data Curation Process
describes data curation process.
- Collect all the records
- diseases to icd10
- drug to SMILES
- ICD-10 code hierarchy
- Sentence Embedding for trial protocol
- Selection criteria of clinical trial
- Data split
- Tutorial
Outcome labels are provided by IQVIA.
output
./raw_data
: store all the xml files for all the trials (identified by NCT ID). mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip
Then we unzip the ZIP file. The unzipped file occupies over 8.6 G. Please make sure you have enough space.
unzip AllPublicXML.zip
cd ../
We use DrugBank to get the molecule structures (SMILES, simplified molecular-input line-entry system) of the drug.
input
None
output
data/drugbank_drugs_info.csv
ClinicalTable is a public API to convert disease name (natural language) into ICD-10 code.
MoleculeNet include five datasets across the main categories of drug pharmaco-kinetics (PK). For absorption, we use the bioavailability dataset. For distribution, we use the blood-brain-barrier experimental results provided. For metabolism, we use the CYP2C19 experiment paper, which is hosted in the PubChem biassay portal under AID 1851. For excretion, we use the clearance dataset from the eDrug3D database. For toxicity, we use the ToxCast dataset, provided by MoleculeNet. We consider drugs that are not toxic across all toxicology assays as not toxic and otherwise toxic.
input
None
output
data/ADMET
download all the records from clinicaltrial.gov. The current version has 370K trial IDs.
input
raw_data/
: raw data, store all the xml files for all the trials (identified by NCT ID).
output
data/all_xml
: store NCT IDs for all the xml files for all the trials. find raw_data/ -name NCT*.xml | sort > data/all_xml
description
The diseases in ClinicalTrialGov are described in natural language.
On the other hand, ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). It leverages the hierarchical information inherent to medical ontologies.
We use ClinicalTable, a public API to convert disease name (natural language) into ICD-10 code.
input
raw_data/
data/all_xml
output
data/diseases.csv
It takes around 2 hours.
python benchmark/collect_disease_from_raw.py
SMILES is simplified molecular-input line-entry system of the molecule.
The drugs in ClinicalTrialGov are described in natural language.
DrugBank contains rich information about drugs.
We use DrugBank to get the molecule structures in terms of SMILES.
input
data/drugbank_drugs_info.csv
output
data/drug2smiles.pkl
python benchmark/drug2smiles.py
We design the following inclusion/exclusion criteria to select eligible clinical trials for learning.
it has outcome label
disease codes are available
drug molecules are available
exclusion criteria
outcome label is not available
disease codes are not available
The csv file contains following features:
nctid
: NCT ID, e.g., NCT00000378, NCT04439305. status
: completed
, terminated
, active, not recruiting
, withdrawn
, unknown status
, suspended
, recruiting
. label
: 0 (failure) or 1 (success). phase
: I, II, III or IV. diseases
: list of diseases. icdcodes
: list of icd-10 codes.drugs
: list of drug namessmiless
: list of SMILEScriteria
: egibility criteria
input
data/diseases.csv
data/drug2smiles.pkl
data/all_xml
output
data/raw_data.csv
python benchmark/collect_raw_data.py | tee data_process.log
python benchmark/nctid2date.py
'./raw_data'
output
data/raw_data.csv
output:
data/phase_I_{train/valid/test}.csv
data/phase_II_{train/valid/test}.csv
data/phase_III_{train/valid/test}.csv
python benchmark/data_split.py
get all the ancestor code for the current icd-10 code.
input
data/raw_data.csv
output:
data/icdcode2ancestor_dict.pkl
python benchmark/icdcode_encode.py
BERT embedding to get sentence embedding for sentence in clinical protocol.
input
data/raw_data.csv
output:
data/sentence2embedding.pkl
python benchmark/protocol_encode.py
We provide a jupyter notebook tutorial in tutorial_benchmark.ipynb
(in the main folder), which describes some key components of the data curation process.
Please contact futianfan@gmail.com for help or submit an issue. This is a joint work with Kexin Huang, Cao(Danica) Xiao, Lucas M. Glass and Jimeng Sun.
The benchmark dataset and code (including data collection and preprocessing, model construction, learning process, evaluation), referred as the Works, are publicly available for Non-Commercial Use only at https://github.com/futianfan/clinical-trial-outcome-prediction. Non-Commercial Use is defined as for academic research or other non-profit educational use which is: (1) not-for-profit; (2) not conducted or funded (unless such funding confers no commercial rights to the funding entity) by an entity engaged in the commercial use, application or exploitation of works similar to the Works; and (3) not intended to produce works for commercial use.