[221dc3]: / tutorial_benchmark.ipynb

Download this file

376 lines (375 with data), 12.1 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial: Benchmark\n",
    "\n",
    "In this repository, we curate trial approval prediction (TAP) benchmark dataset. For ease of understanding, we illutrate some key steps in data curation process. We suggest users to run whole process in command line since it is time-/space consuming. \n",
    "\n",
    "Agenda:\n",
    "\n",
    "<!-- - Raw data\n",
    "  - **clinicaltrials.gov** contains all the clinical trial records\n",
    "  - **DrugBank** contains molecules information for drugs\n",
    "  - **MoleculeNet** contains ADMET data\n",
    "  - **clinicaltables.nlm.nih.gov** converts disease into ICD-10 code\n",
    "- Data curation process  -->\n",
    "\n",
    "- Collect all the trial records\n",
    "- read XML file \n",
    "- Diseases to icd10 \n",
    "- Drug to molecules \n",
    "- Sentence Embedding for trial protocol \n",
    "\n",
    "\n",
    "\n",
    "Let's start!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Collect all the trial records\n",
    "\n",
    "We collect all the trial records from **ClinicalTrials.gov**. Each trial is an XML file, NCT ID is identifiers to each clinical trial. \n",
    "\n",
    "\n",
    "### (i) Download data\n",
    "\n",
    "```bash \n",
    "mkdir -p raw_data\n",
    "cd raw_data\n",
    "wget https://clinicaltrials.gov/AllPublicXML.zip\n",
    "```\n",
    "\n",
    "\n",
    "### (ii) Unzip the ZIP file.\n",
    "The unzipped file occupies over 8.6 G. Please make sure you have enough space. \n",
    "```bash \n",
    "unzip AllPublicXML.zip\n",
    "cd ../\n",
    "```\n",
    "\n",
    "### (iii) Collect all the XML file\n",
    "```bash\n",
    "find raw_data/ -name NCT*.xml | sort > data/all_xml\n",
    "head -3 data/all_xml\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "raw_data/NCT0000xxxx/NCT00000102.xml\n",
    "raw_data/NCT0000xxxx/NCT00000104.xml\n",
    "raw_data/NCT0000xxxx/NCT00000105.xml\n",
    "```\n",
    "NCTID is the identifier of a clinical trial. `NCT00000102`, `NCT00000104`, `NCT00000105` are all NCTIDs. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## read XML file\n",
    "\n",
    "We parse xml file to obtain all the features, including disease names, drug molecules, brief title and summary, phase, eligibility criteria (i.e., protocol).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nctid is NCT00000378\n",
      "study type is Interventional\n",
      "drug intervention: ['Sertraline', 'Nortriptyline']\n",
      "status: Completed\n",
      "phase: Phase 4\n",
      "disease ['Depression', 'Melancholia']\n"
     ]
    }
   ],
   "source": [
    "from xml.etree import ElementTree as ET\n",
    "def xmlfile2results(xml_file):\n",
    "    tree = ET.parse(xml_file)\n",
    "    root = tree.getroot()\n",
    "    nctid = root.find('id_info').find('nct_id').text\t### nctid: 'NCT00000102'\n",
    "    print(\"nctid is\", nctid)\n",
    "    study_type = root.find('study_type').text\n",
    "    print(\"study type is\", study_type)\n",
    "    interventions = [i for i in root.findall('intervention')]\n",
    "    drug_interventions = [i.find('intervention_name').text for i in interventions \\\n",
    "\t\t\t\t\t\t\t\t\t\t\t\t\t\tif i.find('intervention_type').text=='Drug']\n",
    "    print(\"drug intervention:\", drug_interventions)\n",
    "    ### remove 'biologics', \n",
    "    ### non-interventions \n",
    "    if len(drug_interventions)==0:\n",
    "        return (None,)\n",
    "\n",
    "    try:\n",
    "        status = root.find('overall_status').text \n",
    "        print(\"status:\", status)\n",
    "    except:\n",
    "        status = ''\n",
    "\n",
    "    try:\n",
    "        why_stop = root.find('why_stopped').text\n",
    "        print(\"why stop:\", why_stop)\n",
    "    except:\n",
    "        why_stop = ''\n",
    "\n",
    "    try:\n",
    "        phase = root.find('phase').text\n",
    "        print(\"phase:\", phase)\n",
    "    except:\n",
    "        phase = ''\n",
    "    conditions = [i.text for i in root.findall('condition')] ### disease \n",
    "    print(\"disease\", conditions)\n",
    "    \n",
    "xmlfile = \"data/NCT00000378.xml\"\n",
    "xmlfile2results(xmlfile)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## diseases to icd10 code\n",
    "\n",
    "**clinicaltables.nlm.nih.gov** is a public API to convert disease into ICD-10 code\n",
    "\n",
    "In order to illustrate it, we show a simple example as follows. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['C78.00', 'C78.01', 'C78.02', 'D14.30', 'D14.31', 'D14.32', 'C34.2']\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "def get_icd_from_nih(disease_name):\n",
    "\tprefix = 'https://clinicaltables.nlm.nih.gov/api/icd10cm/v3/search?sf=code,name&terms='\n",
    "\turl = prefix + disease_name \n",
    "\tresponse = requests.get(url)\n",
    "\ttext = response.text \n",
    "\tif text == '[0,[],null,[]]':\n",
    "\t\treturn None  \n",
    "\ttext = text[1:-1]\n",
    "\tidx1 = text.find('[')\n",
    "\tidx2 = text.find(']')\n",
    "\tcodes = text[idx1+1:idx2].split(',')\n",
    "\tcodes = [i[1:-1] for i in codes]\n",
    "\treturn codes \n",
    "\n",
    "disease_name = \"lung neoplasm\"\n",
    "print(get_icd_from_nih(disease_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## drug to molecules \n",
    "\n",
    "Molecules is represented in SMILES string, SMILES is a line notation for encoding molecular structure. Drug molecule data are extracted from ClinicalTrials.gov and linked to its molecule structure (SMILES strings) using [DrugBank Database](drugbank.com). \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>moldb_smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Cytarabine</td>\n",
       "      <td>NC1=NC(=O)N(C=C1)[C@@H]1O[C@H](CO)[C@@H](O)[C@...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>aspirin</td>\n",
       "      <td>CC(=O)OC1=CC=CC=C1C(O)=O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Buprenorphine</td>\n",
       "      <td>CO[C@]12CC[C@@]3(C[C@@H]1[C@](C)(O)C(C)(C)C)[C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Buprenorphine</td>\n",
       "      <td>CO[C@]12CC[C@@]3(C[C@@H]1[C@](C)(O)C(C)(C)C)[C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Cyclophosphamide</td>\n",
       "      <td>ClCCN(CCCl)P1(=O)NCCCO1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Zidovudine</td>\n",
       "      <td>CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Lamivudine</td>\n",
       "      <td>NC1=NC(=O)N(C=C1)[C@@H]1CS[C@H](CO)O1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Zidovudine</td>\n",
       "      <td>CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>cyclophosphamide</td>\n",
       "      <td>ClCCN(CCCl)P1(=O)NCCCO1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Zidovudine</td>\n",
       "      <td>CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              title                                       moldb_smiles\n",
       "0        Cytarabine  NC1=NC(=O)N(C=C1)[C@@H]1O[C@H](CO)[C@@H](O)[C@...\n",
       "1           aspirin                           CC(=O)OC1=CC=CC=C1C(O)=O\n",
       "2     Buprenorphine  CO[C@]12CC[C@@]3(C[C@@H]1[C@](C)(O)C(C)(C)C)[C...\n",
       "3     Buprenorphine  CO[C@]12CC[C@@]3(C[C@@H]1[C@](C)(O)C(C)(C)C)[C...\n",
       "4  Cyclophosphamide                            ClCCN(CCCl)P1(=O)NCCCO1\n",
       "5        Zidovudine  CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...\n",
       "6        Lamivudine              NC1=NC(=O)N(C=C1)[C@@H]1CS[C@H](CO)O1\n",
       "7        Zidovudine  CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...\n",
       "8  cyclophosphamide                            ClCCN(CCCl)P1(=O)NCCCO1\n",
       "9        Zidovudine  CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C..."
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import csv\n",
    "import pandas as pd\n",
    "drugbank_file = \"data/drugbank_mini.csv\" #### mini version\n",
    "df = pd.read_csv(drugbank_file)\n",
    "df[['title', 'moldb_smiles']]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentence embedding \n",
    "\n",
    "\n",
    "To convert the criteria sentences into embedding vectors, we apply BERT (Bidirectional Encoder Representations from Transformers), which is a pretraining technique that captures language semantics and exhibits state-of-the-art performance in various NLP tasks. \n",
    "Clinical-BERT is a domain-specific version of BERT. \n",
    "First, we need to install the biobert package. \n",
    "```bash\n",
    "pip install biobert-embedding \n",
    "```\n",
    "\n",
    "Worth to mention that installing biobert-embedding require torch<=1.2.0, which may be incompatible with current environment. So we suggest to open another conda environment to get the sentence embedding. \n",
    "Each sentence in eligibility criteria is converted into a 768-dimensional vector. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([768])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from biobert_embedding.embedding import BiobertEmbedding\n",
    "biobert = BiobertEmbedding()\n",
    "import warnings;warnings.filterwarnings(\"ignore\")\n",
    "sentence = \"Patients must have cognitive dysfunction on neuropsychological testing.\"\n",
    "embedding = biobert.sentence_vector(sentence)\n",
    "embedding.shape "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}