# Tutorial: Benchmark

In this repository, we curate trial approval prediction (TAP) benchmark dataset. For ease of understanding, we illutrate some key steps in data curation process. We suggest users to run whole process in command line since it is time-/space consuming. 

Agenda:

<!-- - Raw data
  - **clinicaltrials.gov** contains all the clinical trial records
  - **DrugBank** contains molecules information for drugs
  - **MoleculeNet** contains ADMET data
  - **clinicaltables.nlm.nih.gov** converts disease into ICD-10 code
- Data curation process  -->

- Collect all the trial records
- read XML file 
- Diseases to icd10 
- Drug to molecules 
- Sentence Embedding for trial protocol 



Let's start!

## Collect all the trial records

We collect all the trial records from **ClinicalTrials.gov**. Each trial is an XML file, NCT ID is identifiers to each clinical trial. 


### (i) Download data

```bash 
mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip
```


### (ii) Unzip the ZIP file.
The unzipped file occupies over 8.6 G. Please make sure you have enough space. 
```bash 
unzip AllPublicXML.zip
cd ../
```

### (iii) Collect all the XML file
```bash
find raw_data/ -name NCT*.xml | sort > data/all_xml
head -3 data/all_xml
```


```
raw_data/NCT0000xxxx/NCT00000102.xml
raw_data/NCT0000xxxx/NCT00000104.xml
raw_data/NCT0000xxxx/NCT00000105.xml
```
NCTID is the identifier of a clinical trial. `NCT00000102`, `NCT00000104`, `NCT00000105` are all NCTIDs. 



## read XML file

We parse xml file to obtain all the features, including disease names, drug molecules, brief title and summary, phase, eligibility criteria (i.e., protocol).


In [18]:
from xml.etree import ElementTree as ET
def xmlfile2results(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    nctid = root.find('id_info').find('nct_id').text	### nctid: 'NCT00000102'
    print("nctid is", nctid)
    study_type = root.find('study_type').text
    print("study type is", study_type)
    interventions = [i for i in root.findall('intervention')]
    drug_interventions = [i.find('intervention_name').text for i in interventions \
														if i.find('intervention_type').text=='Drug']
    print("drug intervention:", drug_interventions)
    ### remove 'biologics', 
    ### non-interventions 
    if len(drug_interventions)==0:
        return (None,)

    try:
        status = root.find('overall_status').text 
        print("status:", status)
    except:
        status = ''

    try:
        why_stop = root.find('why_stopped').text
        print("why stop:", why_stop)
    except:
        why_stop = ''

    try:
        phase = root.find('phase').text
        print("phase:", phase)
    except:
        phase = ''
    conditions = [i.text for i in root.findall('condition')] ### disease 
    print("disease", conditions)
    
xmlfile = "data/NCT00000378.xml"
xmlfile2results(xmlfile)
    

nctid is NCT00000378
study type is Interventional
drug intervention: ['Sertraline', 'Nortriptyline']
status: Completed
phase: Phase 4
disease ['Depression', 'Melancholia']


## diseases to icd10 code

**clinicaltables.nlm.nih.gov** is a public API to convert disease into ICD-10 code

In order to illustrate it, we show a simple example as follows. 



In [19]:
import requests

def get_icd_from_nih(disease_name):
	prefix = 'https://clinicaltables.nlm.nih.gov/api/icd10cm/v3/search?sf=code,name&terms='
	url = prefix + disease_name 
	response = requests.get(url)
	text = response.text 
	if text == '[0,[],null,[]]':
		return None  
	text = text[1:-1]
	idx1 = text.find('[')
	idx2 = text.find(']')
	codes = text[idx1+1:idx2].split(',')
	codes = [i[1:-1] for i in codes]
	return codes 

disease_name = "lung neoplasm"
print(get_icd_from_nih(disease_name))

['C78.00', 'C78.01', 'C78.02', 'D14.30', 'D14.31', 'D14.32', 'C34.2']


## drug to molecules 

Molecules is represented in SMILES string, SMILES is a line notation for encoding molecular structure. Drug molecule data are extracted from ClinicalTrials.gov and linked to its molecule structure (SMILES strings) using [DrugBank Database](drugbank.com). 



In [20]:
import csv
import pandas as pd
drugbank_file = "data/drugbank_mini.csv" #### mini version
df = pd.read_csv(drugbank_file)
df[['title', 'moldb_smiles']]


Unnamed: 0,title,moldb_smiles
0,Cytarabine,NC1=NC(=O)N(C=C1)[C@@H]1O[C@H](CO)[C@@H](O)[C@...
1,aspirin,CC(=O)OC1=CC=CC=C1C(O)=O
2,Buprenorphine,CO[C@]12CC[C@@]3(C[C@@H]1[C@](C)(O)C(C)(C)C)[C...
3,Buprenorphine,CO[C@]12CC[C@@]3(C[C@@H]1[C@](C)(O)C(C)(C)C)[C...
4,Cyclophosphamide,ClCCN(CCCl)P1(=O)NCCCO1
5,Zidovudine,CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...
6,Lamivudine,NC1=NC(=O)N(C=C1)[C@@H]1CS[C@H](CO)O1
7,Zidovudine,CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...
8,cyclophosphamide,ClCCN(CCCl)P1(=O)NCCCO1
9,Zidovudine,CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...


## Sentence embedding 


To convert the criteria sentences into embedding vectors, we apply BERT (Bidirectional Encoder Representations from Transformers), which is a pretraining technique that captures language semantics and exhibits state-of-the-art performance in various NLP tasks. 
Clinical-BERT is a domain-specific version of BERT. 
First, we need to install the biobert package. 
```bash
pip install biobert-embedding 
```

Worth to mention that installing biobert-embedding require torch<=1.2.0, which may be incompatible with current environment. So we suggest to open another conda environment to get the sentence embedding. 
Each sentence in eligibility criteria is converted into a 768-dimensional vector. 

In [21]:
from biobert_embedding.embedding import BiobertEmbedding
biobert = BiobertEmbedding()
import warnings;warnings.filterwarnings("ignore")
sentence = "Patients must have cognitive dysfunction on neuropsychological testing."
embedding = biobert.sentence_vector(sentence)
embedding.shape 

torch.Size([768])