Medical Trial Classification System

Overview

The Medical Trial Classification System is an automated machine learning solution that classifies medical trial descriptions into five disease categories. Currently in partial implementation status, the system aims to reduce the manual effort required in categorizing medical trials.

Current Implementation Status

✅ Core preprocessing pipeline
✅ Basic model implementation
✅ Initial API setup
✅ Basic testing framework
❌ Complete unit test coverage
❌ Advanced preprocessing features
❌ Model optimization
❌ Full system integration testing

Disease Categories

Amyotrophic Lateral Sclerosis (ALS)
Obsessive Compulsive Disorder (OCD)
Parkinson's Disease
Dementia
Scoliosis

Project Structure

root/
├── data/                  # Data storage and processing
├── docs/                  # Project documentation
├── logs/                  # Application logs
├── notebooks/            # Analysis notebooks
├── scripts/              # Utility scripts
├── src/                  # Source code
└── tests/                # Test files

Key Components

src/preprocessing/: Text preprocessing pipeline
src/models/: Model implementation and training
src/data/: Data processing and pipeline
src/utils/: Utility functions and logging
tests/: Test implementations

Installation

Clone the repository:

git clone [https://github.com/fesarikaya/MedicalTrialClassification]
cd MedicalTrialClassification

Run the environment setup:

python environment_setup.py

Requirements

Python 3.8+
8GB+ RAM recommended
Disk space for model storage
Internet connection for package installation

Key Dependencies

Flask==3.0.2
pandas==2.2.0
scikit-learn==1.4.0
nltk==3.8.1
spacy==3.7.2
pytest==8.0.0

Full dependencies are listed in requirements.txt.

Current Performance

Model Performance

Best performer: Bagging Classifier
Accuracy: 50.0%
F1 Score: 0.490

Known Issues

Preprocessing Pipeline
Performance issues in current implementation
Medical term standardization needs improvement
Special character handling requires optimization
Model Performance
Lower than target accuracy due to preprocessing issues
Feature engineering needs enhancement
Model tuning incomplete

Usage

API Endpoints

Prediction Endpoint:

POST /predict
Content-Type: application/json
{
    "description": "Medical trial description text"
}

Health Check:

GET /health

Testing

Basic tests are implemented in the tests/ directory:
- API_test.py: API endpoint testing
- model_evaluation_test.py: Basic model evaluation
- Latest test results available in prediction_test_results.json

Future Work

Preprocessing Enhancements
Optimize medical term handling
Improve text normalization
Enhance special character processing
Model Optimization
Implement advanced feature engineering
Optimize model parameters
Enhance ensemble methods
Testing Completion
Implement comprehensive unit tests
Add integration tests
Complete performance testing

Important Notes

System is currently in partial implementation status
Use with caution and verify all predictions
Current accuracy is limited
Future updates will address known issues

Development Status

The project is currently incomplete due to deadline constraints. Key pending items include:
- Complete unit test coverage
- Advanced preprocessing features
- Model optimization
- Full system integration testing

Warning

⚠️ This system is currently in partial implementation status with known preprocessing issues affecting model performance. Use as an assistance tool only and verify all predictions manually.