# AI-Based Clinical Trial Matching System
This project implements an advanced AI-powered system for matching patients with suitable clinical trials. By leveraging natural language processing, vector embeddings, and machine learning techniques, the system analyzes patient medical records and compares them against inclusion and exclusion criteria of active clinical trials to identify potential matches. This tool aims to streamline the clinical trial enrollment process, helping researchers, healthcare professionals, and patients find relevant trials more efficiently.
The system employs a multi-stage pipeline architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │ │ │
│ Data Ingestion │────▶│ Data Processing│────▶│ Vector Embedding│────▶│ Matching Engine │
│ │ │ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │ │ │
│ Clinical Trials │ │ Patient Data │ │ ChromaDB Vector │ │ JSON Results │
│ Web Scraper │ │ SQLite Database│ │ Store │ │ │
│ │ │ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
csv_to_db.py
)web_scraper_trials.py
)combine_patient_data.py
)summarize_apis/
)create_clinical_trial_embeddings.py
)find_matching_trial.py
)The matching process employs a sophisticated 3-stage algorithm:
Filters trials with scores above a defined threshold
Expert LLM Assessment:
Provides human-readable explanations for match quality
Result Generation:
Using pip:
bash
python -m venv venv
source venv/bin/activate # On Linux or macOS
venv\Scripts\activate # On Windows
Using conda:
bash
conda create -n myenv python=3.9
conda activate myenv
Using pip:
bash
pip install -r requirements.txt
Using conda:
bash
conda env create -f environment.yml
conda activate myenv
.env
file as HUGGINGFACE_KEY={yourkey}
Alternatively, you can use OpenRouter by obtaining a key and setting OPENROUTER_KEY={yourkey}
Prepare Sample Data:
Extract to the project root directory and rename to patient_data
Run the Main Script:
bash
python main.py
This will:
Save results as JSON files in patient_trials_matched/
Adjust Processing Parameters (Optional):
find_matching_trial.py
to change the number of patients processedweb_scraper_trials.py
to adjust the number of trials scrapedLLM API Rate Limits:
The system relies on external LLM APIs which have rate limits. This restricts the number of patients and trials that can be processed in a given time period. Consider using paid API plans or implementing caching mechanisms for production use.
Context Window Constraints:
The Llama 3.2 3B-Instruct model has a context window of 4096 tokens, limiting the amount of patient data and trial information that can be processed simultaneously. This may affect the comprehensiveness of the analysis, particularly for complex medical histories or detailed trial criteria.
LLM Output Variability:
Despite prompt engineering, LLM outputs can vary, affecting the consistency of matching results. This variability is inherent to current language models and can impact the reliability of the eligibility assessments.
Embedding Model Limitations:
The system uses a relatively small embedding model (all-MiniLM-L6-v2) which, while efficient, may not capture all the nuances of medical terminology and relationships compared to larger domain-specific models.
Explore medical domain-specific fine-tuned models for improved understanding of clinical terminology
Advanced Embedding Techniques:
Explore hybrid retrieval approaches combining sparse and dense embeddings
Refined Matching Algorithm:
Develop a feedback loop to improve matching accuracy over time
System Robustness:
Develop monitoring and logging for production deployment
User Interface:
Develop export functionality to various formats (CSV, Excel, Google Sheets)
Domain-Specific Customization:
This project is licensed under the MIT License.