Context:
This data set contains published iTRAQ proteome profiling of 77 breast cancer samples generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH). It contains expression values for ~12.000 proteins for each sample, with missing values present when a given protein could not be quantified in a given sample.
Content:
File: 77_cancer_proteomes_CPTAC_itraq.csv
File: clinical_data_breast_cancer.csv
First column "Complete TCGA ID" is used to match the sample IDs in the main cancer proteomes file (see example script).
All other columns have self-explanatory names, contain data about the cancer classification of a given sample using different methods. 'PAM50 mRNA' classification is being used in the example script.
File: PAM50_proteins.csv
Contains the list of genes and proteins used by the PAM50 classification system. The column RefSeqProteinID contains the protein IDs that can be matched with the IDs in the main protein expression data set.
Past Research:
The original study: http://www.nature.com/nature/journal/v534/n7605/full/nature18003.html (paywall warning)
In brief: the data were used to assess how the mutations in the DNA are affecting the protein expression landscape in breast cancer. Genes in our DNA are first transcribed into RNA molecules which then are translated into proteins. Changing the information content of DNA has impact on the behavior of the proteome, which is the main functional unit of cells, taking care of cell division, DNA repair, enzymatic reactions and signaling etc. They performed K-means clustering on the protein data to divide the breast cancer patients into sub-types, each having unique protein expression signature. They found that the best clustering was achieved using 3 clusters (original PAM50 gene set yields four different subtypes using RNA data).
Inspiration:
This is an interesting study and I myself wanted to use this breast cancer proteome data set for other types of analyses using machine learning that I am performing as a part of my PhD. However, I though that the Kaggle community (or at least that part with biomedical interests) would enjoy playing with it. I added a simple K-means clustering example for that data with some comments, the same approach as used in the original paper.
One thing is that there is a panel of genes, the PAM50 which is used to classify breast cancers into subtypes. This panel was originally based on the RNA expression data which is (in my opinion) not as robust as the measurement of mRNA's final product, the protein. Perhaps using this data set, someone could find a different set of proteins (they all have unique NP_/XP_ identifiers) that would divide the data set even more robustly? Perhaps into a higher numbers of clusters with very distinct protein expression signatures?
Example K-means analysis script:
http://pastebin.com/A0Wj41DP