Diff of /README.md [000000] .. [a2d0eb]

Switch to unified view

a b/README.md
1
2
# ๐Ÿง  Breast Cancer Treatment Prediction 
3
4
## ๐Ÿ“Œ Project Overview
5
6
This project aims to predict which type of treatment a breast cancer patient is likely to receive โ€” specifically **chemotherapy**, **radiotherapy**, or **hormone therapy** โ€” using clinical and molecular features. The core goal is to explore whether treatment decisions can be anticipated based on patient data and how well machine learning can capture such decisions. This can aid in **personalized medicine**, **treatment planning**, and potentially **identifying outliers** in current medical practices.
7
8
## ๐Ÿ“Š Dataset
9
10
The data comes from the **METABRIC (Molecular Taxonomy of Breast Cancer International Consortium)** study, a joint Canada-UK project published in Nature Communications (Pereira et al., 2016). The dataset was obtained via [Kaggle](https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric) and contains:
11
12
- Clinical information (age, tumor size, menopausal state, etc.)
13
- Molecular markers (ER/PR/HER2 status, mutation counts, etc.)
14
- Treatment flags (whether the patient received chemo-, radio-, or hormone therapy)
15
16
## ๐Ÿงช Methodology
17
18
We approached the prediction task in **two main modeling phases**:
19
20
### 1. **Binary Classification: Chemotherapy Prediction**
21
22
- **Objective**: Predict whether a patient receives chemotherapy (Yes/No).
23
- **Model**: Feedforward Neural Network (PyTorch)
24
- **Evaluation**: ROC AUC, Accuracy, Precision, Recall, F1-Score
25
26
### 2. **Multiclass Classification (Simplified)**
27
28
- **Objective**: Predict which *single* therapy the patient receives โ€” **only** one of chemo, radio, or hormone (mutually exclusive).
29
- **Classes**:
30
  - `0`: Chemotherapy only
31
  - `1`: Radiotherapy only
32
  - `2`: Hormone therapy only
33
- **Model**: Multiclass Neural Network
34
- **Note**: Patients who received combinations of treatments were excluded in this simplified scenario.
35
36
## ๐Ÿ“ˆ Feature Selection and Importance
37
38
Before training, we performed a **Random Forest-based feature importance analysis** to determine which features most influenced the decision for chemotherapy.
39
40
### ๐Ÿ† Selected Important Features:
41
42
- `age_at_diagnosis`
43
- `tumor_size`
44
- `tumor_stage`
45
- `lymph_nodes_examined_positive`
46
- `nottingham_prognostic_index`
47
- `cellularity`
48
- `neoplasm_histologic_grade`
49
- `inferred_menopausal_state`
50
- `er_status_measured_by_ihc`
51
- `pr_status`
52
- `her2_status`
53
54
These were used in all models to maintain consistency and reduce overfitting.
55
56
## ๐Ÿ›  Model Architecture
57
58
Both models were trained using a simple fully connected feedforward neural network with:
59
60
- Two hidden layers
61
- ReLU activations
62
- Dropout (0.5) to prevent overfitting
63
- Early stopping based on validation loss
64
65
### ๐Ÿšซ Handling Overfitting
66
67
We observed initial signs of overfitting (training loss โ†“, validation loss โ†‘). To mitigate this, we applied:
68
69
- **Dropout regularization**
70
- **Early stopping** with patience = 10
71
- **Feature reduction** to the most important predictors only
72
73
## ๐Ÿ“Š Results
74
75
### โœ… Binary Model โ€“ Chemotherapy Prediction
76
77
| Metric        | Score     |
78
|---------------|-----------|
79
| Accuracy      | 90.4โ€ฏ%    |
80
| Precision     | 80.0โ€ฏ%    |
81
| Recall        | 75.9โ€ฏ%    |
82
| F1 Score      | 77.9โ€ฏ%    |
83
| ROC AUC       | 95.5โ€ฏ%    |
84
85
โžก๏ธ **High-performing model** that reliably distinguishes patients who receive chemotherapy.
86
87
---
88
89
### ๐Ÿ” Multiclass Model โ€“ Therapy Type (Simplified)
90
91
| Class   | Therapy Type   | Precision | Recall | F1 Score |
92
|---------|----------------|-----------|--------|----------|
93
| 0       | Chemotherapy   | 1.00      | 0.80   | 0.89     |
94
| 1       | Radiotherapy   | 0.82      | 0.71   | 0.76     |
95
| 2       | Hormone        | 0.76      | 0.86   | 0.80     |
96
97
- **Overall Accuracy**: 79.1โ€ฏ%
98
- **Macro F1 Score**: 81.8โ€ฏ%
99
- **Weighted F1 Score**: 78.9โ€ฏ%
100
101
โžก๏ธ Excellent class separation and well-balanced performance across all therapy types.
102
103
## ๐Ÿง  Interpretation
104
105
- The binary model shows that **treatment decisions for chemotherapy are highly predictable**.
106
- The multiclass model proves that **treatment types can be predicted with nearly 80% accuracy**, assuming exclusive therapy application.
107
- Strong predictors include **tumor size**, **nodal involvement**, **hormonal receptor status**, and **Nottingham index** โ€” all aligned with clinical reasoning.
108
109
---
110
111
## ๐Ÿ“Œ Future Work
112
113
- Include patients with combined therapies and model them hierarchically
114
- Explore explainable AI (e.g., SHAP values) for clinical transparency
115
- Deploy model in a lightweight clinical dashboard for testing
116
117
---
118
119
## ๐Ÿ“š References
120
121
- Pereira et al., Nature Communications, 2016  
122
- METABRIC study: [cBioPortal](https://www.cbioportal.org/study/summary?id=brca_metabric)  
123
- Kaggle Dataset: [Breast Cancer - Gene Expression (METABRIC)](https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric)