Benchmark

To standardize the clinical trial outcome prediction, we create a benchmark dataset for Trial Outcome Prediction named TOP, which incorporate rich data components about clinical trials, including drug, disease and protocol (eligibility criteria).
Benchmark can be mainly divided into two parts:
- Raw Data describes all the data sources.
- ClinicalTrial.gov: all the clinical trials records.
- DrugBank: molecule structures of all the drugs.
- ClinicalTable: API for ICD-10 codes.
- MoleculeNet: ADMET data.
- Data Curation Process describes data curation process.
- Collect all the records
- diseases to icd10
- drug to SMILES
- ICD-10 code hierarchy
- Sentence Embedding for trial protocol
- Selection criteria of clinical trial
- Data split
- Tutorial

Raw Data

ClinicalTrial.gov

description
We download all the clinical trials records from ClinicalTrial.gov. The processed data are based on ClinicalTrials.gov database on Feb 20, 2021. It contains 348,891 clinical trial records. The data size grows with time because more clinical trial records are added. It describes many important information about clinical trials, including NCT ID (i.e., identifiers to each clinical study), disease names, drugs, brief title and summary, phase, criteria, and statistical analysis results.
Outcome labels are provided by IQVIA.
output
./raw_data: store all the xml files for all the trials (identified by NCT ID).

mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip

Then we unzip the ZIP file. The unzipped file occupies over 8.6 G. Please make sure you have enough space.

unzip AllPublicXML.zip
cd ../

DrugBank

description
We use DrugBank to get the molecule structures (SMILES, simplified molecular-input line-entry system) of the drug.
input
None
output
data/drugbank_drugs_info.csv

ClinicalTable

ClinicalTable is a public API to convert disease name (natural language) into ICD-10 code.

MoleculeNet

description
MoleculeNet include five datasets across the main categories of drug pharmaco-kinetics (PK). For absorption, we use the bioavailability dataset. For distribution, we use the blood-brain-barrier experimental results provided. For metabolism, we use the CYP2C19 experiment paper, which is hosted in the PubChem biassay portal under AID 1851. For excretion, we use the clearance dataset from the eDrug3D database. For toxicity, we use the ToxCast dataset, provided by MoleculeNet. We consider drugs that are not toxic across all toxicology assays as not toxic and otherwise toxic.
input
None
output
data/ADMET

Data Curation Process

Collect all the records

description
download all the records from clinicaltrial.gov. The current version has 370K trial IDs.
input
raw_data/: raw data, store all the xml files for all the trials (identified by NCT ID).
output
data/all_xml: store NCT IDs for all the xml files for all the trials.

find raw_data/ -name NCT*.xml | sort > data/all_xml

Disease to ICD-10 code

description
The diseases in ClinicalTrialGov are described in natural language.
On the other hand, ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). It leverages the hierarchical information inherent to medical ontologies.
We use ClinicalTable, a public API to convert disease name (natural language) into ICD-10 code.
input
raw_data/
data/all_xml
output
data/diseases.csv

It takes around 2 hours.

python benchmark/collect_disease_from_raw.py

drug to SMILES

description
SMILES is simplified molecular-input line-entry system of the molecule.
The drugs in ClinicalTrialGov are described in natural language.
DrugBank contains rich information about drugs.
We use DrugBank to get the molecule structures in terms of SMILES.
input
data/drugbank_drugs_info.csv
output
data/drug2smiles.pkl

python benchmark/drug2smiles.py

Selection criteria of clinical trial

We design the following inclusion/exclusion criteria to select eligible clinical trials for learning.

inclusion criteria
study-type is interventional
intervention-type is small molecules drug
it has outcome label
disease codes are available
drug molecules are available
exclusion criteria
study-type is observational
intervention-type is surgery, biological, device
outcome label is not available
disease codes are not available
drug molecules are not available

The csv file contains following features:

nctid: NCT ID, e.g., NCT00000378, NCT04439305.
status: completed, terminated, active, not recruiting, withdrawn, unknown status, suspended, recruiting.

label: 0 (failure) or 1 (success).
phase: I, II, III or IV.
diseases: list of diseases.
icdcodes: list of icd-10 codes.
drugs: list of drug names
smiless: list of SMILES
criteria: egibility criteria
input
data/diseases.csv
data/drug2smiles.pkl
data/all_xml
output
data/raw_data.csv

python benchmark/collect_raw_data.py | tee data_process.log

python benchmark/nctid2date.py

input
'data/raw_data.csv'
'./raw_data'
output
'data/nctid_date.txt'

Data Split

description (Split criteria)
phase I: phase I trials
phase II: phase II trials
phase III: phase III trials
input
data/raw_data.csv
output:
data/phase_I_{train/valid/test}.csv
data/phase_II_{train/valid/test}.csv
data/phase_III_{train/valid/test}.csv

python benchmark/data_split.py

ICD-10 code hierarchy

description
get all the ancestor code for the current icd-10 code.
input
data/raw_data.csv
output:
data/icdcode2ancestor_dict.pkl

python benchmark/icdcode_encode.py

Sentence embedding

description
BERT embedding to get sentence embedding for sentence in clinical protocol.
input
data/raw_data.csv
output:
data/sentence2embedding.pkl

python benchmark/protocol_encode.py

Tutorial

We provide a jupyter notebook tutorial in tutorial_benchmark.ipynb (in the main folder), which describes some key components of the data curation process.

Contact

Please contact futianfan@gmail.com for help or submit an issue. This is a joint work with Kexin Huang, Cao(Danica) Xiao, Lucas M. Glass and Jimeng Sun.

Benchmark Usage Agreement

The benchmark dataset and code (including data collection and preprocessing, model construction, learning process, evaluation), referred as the Works, are publicly available for Non-Commercial Use only at https://github.com/futianfan/clinical-trial-outcome-prediction. Non-Commercial Use is defined as for academic research or other non-profit educational use which is: (1) not-for-profit; (2) not conducted or funded (unless such funding confers no commercial rights to the funding entity) by an entity engaged in the commercial use, application or exploitation of works similar to the Works; and (3) not intended to produce works for commercial use.