ai_clinical_trial / Git / [eb2c92] /README.md

Models:

joseph-gordon/

ai_clinical_trial

Downloads: 1

[eb2c92]: / README.md

History

Download this file

222 lines (169 with data), 11.7 kB

# AI-Based Clinical Trial Matching System

Project Overview

This project implements an advanced AI-powered system for matching patients with suitable clinical trials. By leveraging natural language processing, vector embeddings, and machine learning techniques, the system analyzes patient medical records and compares them against inclusion and exclusion criteria of active clinical trials to identify potential matches. This tool aims to streamline the clinical trial enrollment process, helping researchers, healthcare professionals, and patients find relevant trials more efficiently.

Core Architecture

The system employs a multi-stage pipeline architecture:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │     │                 │
│  Data Ingestion │────▶│  Data Processing│────▶│ Vector Embedding│────▶│ Matching Engine │
│                 │     │                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │                       │
        ▼                       ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │     │                 │
│ Clinical Trials │     │  Patient Data   │     │ ChromaDB Vector │     │ JSON Results    │
│    Web Scraper  │     │  SQLite Database│     │     Store       │     │                 │
│                 │     │                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘     └─────────────────┘

Key Components

1. Data Ingestion & Storage

Patient Data Processing (`csv_to_db.py`)

Converts patient CSV files into a structured SQLite relational database
Cleans and normalizes column names for consistency
Creates tables for different types of patient data (allergies, conditions, medications, etc.)
Provides a queryable foundation for patient information

Clinical Trial Web Scraper (`web_scraper_trials.py`)

Asynchronously scrapes clinical trial data from clinicaltrials.gov
Extracts NCT IDs, titles, inclusion criteria, and exclusion criteria
Implements pagination handling, retry mechanisms, and progress tracking
Stores processed trial data in ChromaDB for vector similarity searching

2. Data Processing & Summarization

Patient Profile Creation (`combine_patient_data.py`)

Consolidates patient information from multiple database tables
Creates comprehensive patient profiles with relevant medical history
Calculates patient age and formats data for LLM processing
Provides a unified view of each patient's health status

LLM Summarization Pipeline (`summarize_apis/`)

Connects to language model APIs (Hugging Face, OpenRouter)
Summarizes complex patient data into concise, structured profiles
Processes clinical trial criteria for better matching
Supports multiple LLM providers with a consistent interface

3. Vector Embedding & Storage

Embedding Generation (`create_clinical_trial_embeddings.py`)

Creates vector embeddings for patient profiles and clinical trial criteria
Uses SentenceTransformer models for semantic representation
Maintains separate collections for inclusion and exclusion criteria
Enables efficient similarity searching and comparison

4. Matching Algorithm (`find_matching_trial.py`)

The matching process employs a sophisticated 3-stage algorithm:

Vector Similarity Search:
Embeds patient profiles using SentenceTransformer
Calculates cosine similarity between patient embeddings and trial criteria embeddings
Scores trials based on similarity to inclusion criteria and dissimilarity to exclusion criteria
Filters trials with scores above a defined threshold
Expert LLM Assessment:
For top-scoring trials, performs detailed eligibility analysis using LLM
Evaluates patient data against specific inclusion/exclusion criteria
Generates eligibility scores and detailed reasoning
Provides human-readable explanations for match quality
Result Generation:
Compiles matching trials and their assessments into structured JSON format
Includes trial IDs, names, and detailed eligibility criteria matches
Saves results for each patient for further analysis or integration

Technical Implementation Details

Data Flow

Patient data from CSV files is processed and stored in a SQLite database
Clinical trial data is scraped from clinicaltrials.gov and stored in ChromaDB
Patient profiles are created by querying the SQLite database
LLM summarizes patient profiles and trial criteria
Vector embeddings are generated for patient profiles and trial criteria
The matching algorithm identifies suitable trials for each patient
Results are saved as JSON files

Key Technologies

Database: SQLite with SQLAlchemy ORM
Web Scraping: AsyncWebCrawler with retry mechanisms
Vector Database: ChromaDB for efficient similarity searching
Embeddings: SentenceTransformer (all-MiniLM-L6-v2)
Language Models: Llama 3.2 3B-Instruct via Hugging Face/OpenRouter APIs
Data Processing: Pandas for CSV handling and data manipulation
Asynchronous Processing: Python asyncio for concurrent operations

Features

Web Scraper: Fetches the latest ongoing clinical trials from clinicaltrials.gov and stores them as vector embeddings in ChromaDB
Patient Data Preprocessing: Converts CSV files into a structured SQLite database for efficient querying
LLM Pipeline for Summarization: Connects to local or online LLM APIs to summarize patient data and trial criteria
Matching Algorithm: Implements a 3-stage matching process combining vector similarity, LLM assessment, and threshold filtering
Documentation: Provides comprehensive documentation of the system architecture and components
JSON File Output: Generates structured output files containing matching trials and eligibility assessments
Unit and Integration Tests: Test suite for ensuring reliability and accuracy
Google Sheet Output: Export functionality for collaborative review

Setting Up the Environment

Create a virtual environment:

Using pip:
bash python -m venv venv source venv/bin/activate # On Linux or macOS venv\Scripts\activate # On Windows

Using conda:
bash conda create -n myenv python=3.9 conda activate myenv

Install dependencies:

Using pip:
bash pip install -r requirements.txt

Using conda:
bash conda env create -f environment.yml conda activate myenv

Running the System

Obtain API Keys:
Get a Hugging Face API key from huggingface.co
Store it in a .env file as HUGGINGFACE_KEY={yourkey}
Alternatively, you can use OpenRouter by obtaining a key and setting OPENROUTER_KEY={yourkey}
Prepare Sample Data:
Download sample patient data from here
Extract to the project root directory and rename to patient_data
Run the Main Script:
bash python main.py
This will:
Set up the patient SQLite database
Scrape clinical trials and store them in ChromaDB
Run the matching algorithm
Save results as JSON files in patient_trials_matched/
Adjust Processing Parameters (Optional):
Modify find_matching_trial.py to change the number of patients processed
Update web_scraper_trials.py to adjust the number of trials scraped

Technical Limitations

LLM API Rate Limits:
The system relies on external LLM APIs which have rate limits. This restricts the number of patients and trials that can be processed in a given time period. Consider using paid API plans or implementing caching mechanisms for production use.
Context Window Constraints:
The Llama 3.2 3B-Instruct model has a context window of 4096 tokens, limiting the amount of patient data and trial information that can be processed simultaneously. This may affect the comprehensiveness of the analysis, particularly for complex medical histories or detailed trial criteria.
LLM Output Variability:
Despite prompt engineering, LLM outputs can vary, affecting the consistency of matching results. This variability is inherent to current language models and can impact the reliability of the eligibility assessments.
Embedding Model Limitations:
The system uses a relatively small embedding model (all-MiniLM-L6-v2) which, while efficient, may not capture all the nuances of medical terminology and relationships compared to larger domain-specific models.

Future Improvements

Enhanced LLM Integration:
Implement larger models like GPT-4, Claude 3 Opus, or Llama 3.2 70B for more accurate analysis
Explore medical domain-specific fine-tuned models for improved understanding of clinical terminology
Advanced Embedding Techniques:
Implement keyword extraction before embedding generation
Use larger, medical domain-specific embedding models
Explore hybrid retrieval approaches combining sparse and dense embeddings
Refined Matching Algorithm:
Optimize the weighting between inclusion and exclusion criteria similarity
Implement more sophisticated scoring mechanisms that account for the importance of different criteria
Develop a feedback loop to improve matching accuracy over time
System Robustness:
Add comprehensive test suite for all components
Implement caching and rate-limiting strategies for API calls
Develop monitoring and logging for production deployment
User Interface:
Create a web interface for easier interaction with the system
Implement visualization tools for match results
Develop export functionality to various formats (CSV, Excel, Google Sheets)
Domain-Specific Customization:
Fine-tune models on medical literature and clinical trial data
Implement specialized processing for different medical specialties
Develop custom prompts for different types of clinical trials

License

This project is licensed under the MIT License.