This project aims to predict which type of treatment a breast cancer patient is likely to receive โ specifically chemotherapy, radiotherapy, or hormone therapy โ using clinical and molecular features. The core goal is to explore whether treatment decisions can be anticipated based on patient data and how well machine learning can capture such decisions. This can aid in personalized medicine, treatment planning, and potentially identifying outliers in current medical practices.
The data comes from the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) study, a joint Canada-UK project published in Nature Communications (Pereira et al., 2016). The dataset was obtained via Kaggle and contains:
We approached the prediction task in two main modeling phases:
0
: Chemotherapy only1
: Radiotherapy only2
: Hormone therapy onlyBefore training, we performed a Random Forest-based feature importance analysis to determine which features most influenced the decision for chemotherapy.
age_at_diagnosis
tumor_size
tumor_stage
lymph_nodes_examined_positive
nottingham_prognostic_index
cellularity
neoplasm_histologic_grade
inferred_menopausal_state
er_status_measured_by_ihc
pr_status
her2_status
These were used in all models to maintain consistency and reduce overfitting.
Both models were trained using a simple fully connected feedforward neural network with:
We observed initial signs of overfitting (training loss โ, validation loss โ). To mitigate this, we applied:
Metric | Score |
---|---|
Accuracy | 90.4โฏ% |
Precision | 80.0โฏ% |
Recall | 75.9โฏ% |
F1 Score | 77.9โฏ% |
ROC AUC | 95.5โฏ% |
โก๏ธ High-performing model that reliably distinguishes patients who receive chemotherapy.
Class | Therapy Type | Precision | Recall | F1 Score |
---|---|---|---|---|
0 | Chemotherapy | 1.00 | 0.80 | 0.89 |
1 | Radiotherapy | 0.82 | 0.71 | 0.76 |
2 | Hormone | 0.76 | 0.86 | 0.80 |
โก๏ธ Excellent class separation and well-balanced performance across all therapy types.