73 lines (72 with data), 8.0 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 12 Questions to clinical trial eligibility\n",
"\n",
"This projects aims to simplify the daunting task of potential candidates finding clinical trials, by transforming eligibility criteria into a simple dynamic questionnaire. After collecting very basic data (e.g., age, gender-assigned-at-birth, location), the candidate will be asked a series of simple Yes/No questions. Each question in the series will aim to maximize the information gain at each step. Because we do not know the candidates a priori, the questions will be asked in a sequential manner, with the next question based on the answer to the previous question. At each step, the question that maximally narrows the number of trials the candidate is eligible for without losing trials that they may be eligible for will be asked. The questions will be ranked based on the information gain at each step (see **Step 2** below for details).\n",
"\n",
"## Problem Statement\n",
"While clinical trials are listed in detail on websites like ClinicalTrials.gov, with APIs to query and access clinical trial data posted, it remains a daunting task for candidates to find suitable trials among the ~100,000 trials recruiting or recruiting soon. In particular, while many components of the clinical trials are parsed into structured data, the eligibility criteria (the key component needed for candidates to assess their eligibility) are often written in long, unstructured text with a lot of medical jargon. This makes it difficult for candidates to quickly assess their eligibility for a trial through search filters provided by these websites.\n",
"\n",
"## Approach\n",
"### Step 1. Per-trial question set\n",
"I will use a dataset of clinical trial eligibility criteria, and use natural language processing (NLP) techniques to parse the text into structured data. Of particular importance will be to identify the key components of the eligibility criteria and to unify synonymous terms. Once the eligibility criteria are parsed into structured data, a particular clinical trail's eligibility criteria can be transformed into a set of simple Yes/No questions using LLMs. Questions across trials that are very similar will be standardized/synonymized to generalise across trials as much as possible so that the information gain at each step is maximized (e.g., \"Do you have a history of heart disease?\" vs. \"Have you been diagnosed with heart disease?\"). The trade-off here will be that aggressive generalization may lead to loss of information in terms of eligibility. The degree of question generalization will be optimized for the best balance between information gain per step and possible opportunity loss for the candidate.\n",
"\n",
"*The question set will exclude basic information such as age, gender-assigned-at-birth, and location, which will be collected at the outset.*\n",
"\n",
"### Step 2. Question ranking\n",
"The questions will be ranked based on the information gain at each step. Information gain in this context is defined as the question that is most likely to give information about the candidate's eligibility for the maximum number of trials. In other words, we don't want to ask questions that are too specific to a small number of trials, but rather questions that are general enough to exclude or include a large number of trials. So, calculating information gain will be based on the total number of trials the question is relevant to, as well as the symmetry of the split between trials that the candidate is eligible if they answer Yes/No. The symmetry of the split is important. For example, if most trials ask for candidates that do not have a history of heart disease, then asking \"Do you have a history of heart disease?\" will only gain a lot of information for those that have heart disease. Those without heart disease will be left with an insufficiently narrowed dataset. In the heart disease question question will have low overall information gain. \n",
"\n",
"The question that has the highest information gain as defined before, will be asked first. Once a question is answered, the top-ranked question will be asked next for the narrowed down set of trials. This process will continue until the candidate is left with a small number of trials that they can then explore in detail and sign up for.\n",
"\n",
"I hope that a sufficiently narrow set of trials with good fit can be achieved for most candidates in less than 12 questions.\n",
"\n",
"*The information gain will be calculated as the entropy of the dataset before the question minus the entropy of the dataset after the question. The entropy will be calculated as the sum of the negative log of the probability of each class in the dataset. The entropy will be calculated for the dataset before the question and after the question for both the Yes and No answers. The information gain will be calculated as the entropy before the question minus the weighted average of the entropy after the question for the Yes and No answers. The question with the highest information gain will be asked first.*\n",
"\n",
"#### Note\n",
"Trial questions, especially when standardized/grouped across trials, may not give absolute information about the candidate's eligibility on a per trial basis. It may be more useful/appropriate to think of questions as a way of determining the probability of eligibility for a trial. The goal is to maximize the probability of eligibility for the candidate across all trials, rather than to determine eligibility for a specific trial. The candidate will be able to see the list of trials they are eligible for at the end of the questionnaire and can then explore the details of each trial to determine if they are a good fit.\n",
"\n",
"### Step 3. User interface\n",
"The user interface will be a simple web form that asks the candidate the questions in a sequential manner. The candidate will be able to see the progress and the number of trials they are eligible for at each step. The candidate will be able to see the final list of trials they are eligible for and will be able to click on the trial to get more information and sign up.\n",
"\n",
"I will aim to keep this very simple at first, but will add more features as needed.\n",
"\n",
"Candidate data will contain no personally identifiable information (PII) and will be stored securely. The candidate will be able to opt-in to receive updates on new trials that they may be eligible for by providing an email address, but this will not be mandatory.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Aggregation\n",
"This notebook will focus on the first step of the approach, which is to parse the eligibility criteria of clinical trials into structured data. I will use the Clinical Trials API provided by clinicaltrials.gov to extract clinical trial data for trials that are currently recruiting.\n",
"\n",
"The data will be stored in a structured format in a database, and will be used to generate the question set for each trial. In the initial phase of the project, I will focus on a subset of trials to develop the question set and the ranking algorithm. Once the question set and ranking algorithm are developed, I will scale up to include all trials recruiting or recruiting soon, with regular updates to the database to include new trials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}