Card

๐Ÿง  Breast Cancer Treatment Prediction

๐Ÿ“Œ Project Overview

This project aims to predict which type of treatment a breast cancer patient is likely to receive โ€” specifically chemotherapy, radiotherapy, or hormone therapy โ€” using clinical and molecular features. The core goal is to explore whether treatment decisions can be anticipated based on patient data and how well machine learning can capture such decisions. This can aid in personalized medicine, treatment planning, and potentially identifying outliers in current medical practices.

๐Ÿ“Š Dataset

The data comes from the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) study, a joint Canada-UK project published in Nature Communications (Pereira et al., 2016). The dataset was obtained via Kaggle and contains:

  • Clinical information (age, tumor size, menopausal state, etc.)
  • Molecular markers (ER/PR/HER2 status, mutation counts, etc.)
  • Treatment flags (whether the patient received chemo-, radio-, or hormone therapy)

๐Ÿงช Methodology

We approached the prediction task in two main modeling phases:

1. Binary Classification: Chemotherapy Prediction

  • Objective: Predict whether a patient receives chemotherapy (Yes/No).
  • Model: Feedforward Neural Network (PyTorch)
  • Evaluation: ROC AUC, Accuracy, Precision, Recall, F1-Score

2. Multiclass Classification (Simplified)

  • Objective: Predict which single therapy the patient receives โ€” only one of chemo, radio, or hormone (mutually exclusive).
  • Classes:
  • 0: Chemotherapy only
  • 1: Radiotherapy only
  • 2: Hormone therapy only
  • Model: Multiclass Neural Network
  • Note: Patients who received combinations of treatments were excluded in this simplified scenario.

๐Ÿ“ˆ Feature Selection and Importance

Before training, we performed a Random Forest-based feature importance analysis to determine which features most influenced the decision for chemotherapy.

๐Ÿ† Selected Important Features:

  • age_at_diagnosis
  • tumor_size
  • tumor_stage
  • lymph_nodes_examined_positive
  • nottingham_prognostic_index
  • cellularity
  • neoplasm_histologic_grade
  • inferred_menopausal_state
  • er_status_measured_by_ihc
  • pr_status
  • her2_status

These were used in all models to maintain consistency and reduce overfitting.

๐Ÿ›  Model Architecture

Both models were trained using a simple fully connected feedforward neural network with:

  • Two hidden layers
  • ReLU activations
  • Dropout (0.5) to prevent overfitting
  • Early stopping based on validation loss

๐Ÿšซ Handling Overfitting

We observed initial signs of overfitting (training loss โ†“, validation loss โ†‘). To mitigate this, we applied:

  • Dropout regularization
  • Early stopping with patience = 10
  • Feature reduction to the most important predictors only

๐Ÿ“Š Results

โœ… Binary Model โ€“ Chemotherapy Prediction

Metric Score
Accuracy 90.4โ€ฏ%
Precision 80.0โ€ฏ%
Recall 75.9โ€ฏ%
F1 Score 77.9โ€ฏ%
ROC AUC 95.5โ€ฏ%

โžก๏ธ High-performing model that reliably distinguishes patients who receive chemotherapy.


๐Ÿ” Multiclass Model โ€“ Therapy Type (Simplified)

Class Therapy Type Precision Recall F1 Score
0 Chemotherapy 1.00 0.80 0.89
1 Radiotherapy 0.82 0.71 0.76
2 Hormone 0.76 0.86 0.80
  • Overall Accuracy: 79.1โ€ฏ%
  • Macro F1 Score: 81.8โ€ฏ%
  • Weighted F1 Score: 78.9โ€ฏ%

โžก๏ธ Excellent class separation and well-balanced performance across all therapy types.

๐Ÿง  Interpretation

  • The binary model shows that treatment decisions for chemotherapy are highly predictable.
  • The multiclass model proves that treatment types can be predicted with nearly 80% accuracy, assuming exclusive therapy application.
  • Strong predictors include tumor size, nodal involvement, hormonal receptor status, and Nottingham index โ€” all aligned with clinical reasoning.

๐Ÿ“Œ Future Work

  • Include patients with combined therapies and model them hierarchically
  • Explore explainable AI (e.g., SHAP values) for clinical transparency
  • Deploy model in a lightweight clinical dashboard for testing

๐Ÿ“š References