|
a |
|
b/README.md |
|
|
1 |
# Machine-Learning-for-Disease-Treatment-Response-Prediction |
|
|
2 |
# Background |
|
|
3 |
Breast cancer is the most common cancer in the UK for women. Chemotherapy |
|
|
4 |
is a commonly used treatment strategy to reduce the size of locally advanced |
|
|
5 |
tumour before surgery. However, chemotherapy is a toxic process to human |
|
|
6 |
body and it is not aways effective to everyone. Complete tumour resolution at |
|
|
7 |
surgery, known as pathological complete response (PCR), has a high |
|
|
8 |
likelihood of achieving cure and longer relapse-free survival (RFS) time. RFS |
|
|
9 |
is the length of time after primary treatment for a cancer ends that the patient |
|
|
10 |
survives without any signs or symptoms of that cancer. However, only 25% of |
|
|
11 |
patients receiving chemotherapy will achieve a PCR, with the remaining 75% |
|
|
12 |
having residual disease and a range of prognosis. Better patient stratification |
|
|
13 |
and treatment could be achieved if PCR and RFS could be predicted using |
|
|
14 |
information prior to chemotherapy treatment. |
|
|
15 |
|
|
|
16 |
# Aim |
|
|
17 |
To use advanced machine learning method to predict PCR |
|
|
18 |
(classification) and RFS (regression) using both clinically measured features |
|
|
19 |
and features derived from magnetic resonance images (MRI) prior to |
|
|
20 |
chemotherapy treatment. |
|
|
21 |
|
|
|
22 |
# Data |
|
|
23 |
Based on the public dataset from The American College of Radiology Imaging |
|
|
24 |
Network (I-SPY 2 TRIAL), a simplified dataset is generated for this assignment. |
|
|
25 |
Each patient in this dataset contains 10 clinical features (Age, ER, PgG, HER2, |
|
|
26 |
TrippleNegative Status, Chemotherapy Grade, Tumour Proliferation, Histology 2 |
|
|
27 |
Type, Lymph node Status and Tumour Stage) and 107 MRI-based features. |
|
|
28 |
The image-based features were extracted from the tumour region of MRIs using |
|
|
29 |
a radiomics feature extraction package (known as Pyradiomics: |
|
|
30 |
https://pyradiomics.readthedocs.io/en/latest/ ). You do not need to understand |
|
|
31 |
the meaning of these clinical feature and image-based features to complete this |
|
|
32 |
assignment but worth reading background information on the I-SPY 2 Trial |
|
|
33 |
website. “999” in the spreadsheet means a missing data value. A training |
|
|
34 |
dataset (trainDataset.xls) is provided and available on Moodle that contains |
|
|
35 |
400 patients. A test dataset that contains N patients is reserved (hidden from |
|
|
36 |
you) for final performance evaluation. You can assume that the test set and |
|
|
37 |
training set are sampled from the same data distribution, but the ratio of PCR |
|
|
38 |
positive and negative could be different. |