Diff of /benchmark/README.md [000000] .. [bc9e98]

Switch to unified view

a b/benchmark/README.md
1
# Benchmark
2
3
To standardize the clinical trial outcome prediction, we create a benchmark dataset for Trial Outcome Prediction named TOP, which incorporate rich data components about clinical trials, including drug, disease and protocol (eligibility criteria). 
4
Benchmark can be mainly divided into two parts:
5
- `Raw Data` describes all the data sources. 
6
  - [`ClinicalTrial.gov`](https://clinicaltrials.gov): all the clinical trials records. 
7
  - [`DrugBank`](https://go.drugbank.com/): molecule structures of all the drugs. 
8
  - [`ClinicalTable`](https://clinicaltables.nlm.nih.gov/): API for ICD-10 codes. 
9
  - [`MoleculeNet`](https://moleculenet.org/): ADMET data. 
10
- `Data Curation Process` describes data curation process.
11
  - Collect all the records
12
  - diseases to icd10 
13
  - drug to SMILES 
14
  - ICD-10 code hierarchy
15
  - Sentence Embedding for trial protocol 
16
  - Selection criteria of clinical trial
17
  - Data split 
18
- Tutorial 
19
20
## Raw Data 
21
22
### ClinicalTrial.gov
23
- description
24
  - We download all the clinical trials records from [ClinicalTrial.gov](https://clinicaltrials.gov/AllPublicXML.zip). The processed data are based on ClinicalTrials.gov database on Feb 20, 2021. It contains 348,891 clinical trial records. The data size grows with time because more clinical trial records are added. It describes many important information about clinical trials, including NCT ID (i.e.,  identifiers to each clinical study), disease names, drugs, brief title and summary, phase, criteria, and statistical analysis results. 
25
  - **Outcome labels** are provided by **IQVIA**. 
26
27
- output
28
  - `./raw_data`: store all the xml files for all the trials (identified by NCT ID).  
29
30
<!-- When the `p-value` is smaller than 0.05, we take it as positive sample. Please see `benchmark/pseudolabel.py`.  -->
31
32
33
```bash 
34
mkdir -p raw_data
35
cd raw_data
36
wget https://clinicaltrials.gov/AllPublicXML.zip
37
```
38
39
40
Then we unzip the ZIP file. The unzipped file occupies over 8.6 G. Please make sure you have enough space. 
41
```bash 
42
unzip AllPublicXML.zip
43
cd ../
44
```
45
46
### DrugBank
47
48
- description
49
  - We use [DrugBank](https://go.drugbank.com/) to get the molecule structures ([SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system), simplified molecular-input line-entry system) of the drug. 
50
51
- input
52
  - None 
53
54
- output
55
  - `data/drugbank_drugs_info.csv `  
56
57
### ClinicalTable
58
59
[ClinicalTable](https://clinicaltables.nlm.nih.gov/) is a public API to convert disease name (natural language) into ICD-10 code. 
60
61
### MoleculeNet
62
- description
63
  - [MoleculeNet](https://moleculenet.org/) include five datasets across the main categories of drug pharmaco-kinetics (PK). For absorption, we use the bioavailability dataset. For distribution, we use the blood-brain-barrier experimental results provided. For metabolism, we use the CYP2C19 experiment paper, which is hosted in the PubChem biassay portal under AID 1851. For excretion, we use the clearance dataset from the eDrug3D database. For toxicity, we use the ToxCast dataset, provided by MoleculeNet. We consider drugs that are not toxic across all toxicology assays as not toxic and otherwise toxic. 
64
65
- input
66
  - None 
67
68
- output 
69
  - `data/ADMET`
70
71
---
72
73
## Data Curation Process 
74
75
### Collect all the records
76
- description
77
  - download all the records from clinicaltrial.gov. The current version has 370K trial IDs. 
78
79
- input
80
  - `raw_data/`: raw data, store all the xml files for all the trials (identified by NCT ID).   
81
82
- output
83
  - `data/all_xml`: store NCT IDs for all the xml files for all the trials.  
84
85
```bash
86
find raw_data/ -name NCT*.xml | sort > data/all_xml
87
```
88
89
90
### Disease to ICD-10 code
91
92
- description
93
94
  - The diseases in [ClinicalTrialGov](clinicaltrials.gov) are described in natural language. 
95
96
  - On the other hand, [ICD-10](https://en.wikipedia.org/wiki/ICD-10) is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). It leverages the hierarchical information inherent to medical ontologies. 
97
98
  - We use [ClinicalTable](https://clinicaltables.nlm.nih.gov/), a public API to convert disease name (natural language) into ICD-10 code. 
99
100
- input 
101
  - `raw_data/ ` 
102
  - `data/all_xml`   
103
104
- output
105
  - `data/diseases.csv ` 
106
107
It takes around 2 hours. 
108
109
```bash 
110
python benchmark/collect_disease_from_raw.py
111
```
112
113
114
115
### drug to SMILES 
116
117
- description
118
  - [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) is simplified molecular-input line-entry system of the molecule. 
119
120
  - The drugs in [ClinicalTrialGov](clinicaltrials.gov) are described in natural language. 
121
122
  - [DrugBank](https://go.drugbank.com/) contains rich information about drugs. 
123
124
  - We use [DrugBank](https://go.drugbank.com/) to get the molecule structures in terms of SMILES. 
125
126
- input
127
  - `data/drugbank_drugs_info.csv `  
128
129
- output
130
  - `data/drug2smiles.pkl `  
131
132
```bash
133
python benchmark/drug2smiles.py 
134
```
135
136
137
138
### Selection criteria of clinical trial
139
140
We design the following inclusion/exclusion criteria to select eligible clinical trials for learning. 
141
142
- inclusion criteria 
143
  - study-type is interventional 
144
  - intervention-type is small molecules drug
145
  - it has outcome label
146
  <!-- - p-value in primary-outcome is available -->
147
  - disease codes are available 
148
  - drug molecules are available 
149
  <!-- - eligibility criteria are available -->
150
151
152
- exclusion criteria 
153
  - study-type is observational 
154
  - intervention-type is surgery, biological, device
155
  - outcome label is not available 
156
  <!-- - p-value in primary-outcome is not available -->
157
  - disease codes are not available 
158
  - drug molecules are not available 
159
  <!-- - eligibility criteria are not available -->
160
161
The csv file contains following features:
162
163
* `nctid`: NCT ID, e.g., NCT00000378, NCT04439305. 
164
* `status`: `completed`, `terminated`, `active, not recruiting`, `withdrawn`, `unknown status`, `suspended`, `recruiting`. 
165
<!-- * `why_stop`: for completed, it is empty. Otherwise, the common reasons contain `slow/low/poor accrual`, `lack of efficacy` -->
166
* `label`: 0 (failure) or 1 (success).  
167
* `phase`: I, II, III or IV. 
168
* `diseases`: list of diseases. 
169
* `icdcodes`: list of icd-10 codes.
170
* `drugs`: list of drug names
171
* `smiless`: list of SMILES
172
* `criteria`: egibility criteria 
173
174
- input    
175
  - `data/diseases.csv ` 
176
  - `data/drug2smiles.pkl`  
177
  - `data/all_xml ` 
178
179
- output 
180
  - `data/raw_data.csv` 
181
182
183
```bash
184
python benchmark/collect_raw_data.py | tee data_process.log 
185
```
186
187
188
```bash
189
python benchmark/nctid2date.py 
190
```
191
192
- input
193
  - 'data/raw_data.csv'
194
  - './raw_data'
195
196
- output 
197
  - 'data/nctid_date.txt'
198
199
200
<!-- <p align="center"><img src="./dataset.png" alt="logo" width="650px" /></p> -->
201
202
203
204
205
### Data Split 
206
- description (Split criteria)
207
  - phase I: phase I trials
208
  - phase II: phase II trials
209
  - phase III: phase III trials
210
- input
211
  - `data/raw_data.csv` 
212
213
- output: 
214
  - `data/phase_I_{train/valid/test}.csv` 
215
  - `data/phase_II_{train/valid/test}.csv` 
216
  - `data/phase_III_{train/valid/test}.csv` 
217
218
219
```bash
220
python benchmark/data_split.py 
221
```
222
223
224
### ICD-10 code hierarchy 
225
226
- description 
227
  - get all the ancestor code for the current icd-10 code. 
228
229
- input
230
  - `data/raw_data.csv` 
231
232
- output: 
233
  - `data/icdcode2ancestor_dict.pkl` 
234
235
236
```bash 
237
python benchmark/icdcode_encode.py 
238
```
239
240
### Sentence embedding 
241
242
- description 
243
  - BERT embedding to get sentence embedding for sentence in clinical protocol. 
244
245
- input
246
  - `data/raw_data.csv` 
247
248
- output: 
249
  - `data/sentence2embedding.pkl` 
250
251
252
```bash 
253
python benchmark/protocol_encode.py 
254
```
255
256
257
258
## Tutorial 
259
260
We provide a jupyter notebook tutorial in `tutorial_benchmark.ipynb` (in the main folder), which describes some key components of the data curation process. 
261
262
263
264
265
266
## Contact
267
268
Please contact futianfan@gmail.com for help or submit an issue. This is a joint work with [Kexin Huang](https://www.kexinhuang.com/), [Cao(Danica) Xiao](https://sites.google.com/view/danicaxiao/), Lucas M. Glass and [Jimeng Sun](http://sunlab.org/). 
269
270
271
272
273
## Benchmark Usage Agreement
274
275
The benchmark dataset and code (including data collection and preprocessing, model construction, learning process, evaluation), referred as the Works, are publicly available for Non-Commercial Use only at https://github.com/futianfan/clinical-trial-outcome-prediction. Non-Commercial Use is defined as for academic research or other non-profit educational use which is: (1) not-for-profit; (2) not conducted or funded (unless such funding confers no commercial rights to the funding entity) by an entity engaged in the commercial use, application or exploitation of works similar to the Works; and (3) not intended to produce works for commercial use.
276
277
278
279
280
281
282
283
284
285
286
287
288
289