Card

About Dataset

Context:
This data set contains published iTRAQ proteome profiling of 77 breast cancer samples generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH). It contains expression values for ~12.000 proteins for each sample, with missing values present when a given protein could not be quantified in a given sample.

Content:

File: 77_cancer_proteomes_CPTAC_itraq.csv

  • RefSeq_accession_number: RefSeq protein ID (each protein has a unique
    ID in a RefSeq database)
  • gene_symbol: a symbol unique to each gene (every protein is encoded
    by some gene)
  • gene_name: a full name of that gene
    Remaining columns: log2 iTRAQ ratios for each sample (protein
    expression data, most important), three last columns are from healthy
    individuals

File: clinical_data_breast_cancer.csv

First column "Complete TCGA ID" is used to match the sample IDs in the main cancer proteomes file (see example script).
All other columns have self-explanatory names, contain data about the cancer classification of a given sample using different methods. 'PAM50 mRNA' classification is being used in the example script.

File: PAM50_proteins.csv

Contains the list of genes and proteins used by the PAM50 classification system. The column RefSeqProteinID contains the protein IDs that can be matched with the IDs in the main protein expression data set.

Past Research:
The original study: http://www.nature.com/nature/journal/v534/n7605/full/nature18003.html (paywall warning)

In brief: the data were used to assess how the mutations in the DNA are affecting the protein expression landscape in breast cancer. Genes in our DNA are first transcribed into RNA molecules which then are translated into proteins. Changing the information content of DNA has impact on the behavior of the proteome, which is the main functional unit of cells, taking care of cell division, DNA repair, enzymatic reactions and signaling etc. They performed K-means clustering on the protein data to divide the breast cancer patients into sub-types, each having unique protein expression signature. They found that the best clustering was achieved using 3 clusters (original PAM50 gene set yields four different subtypes using RNA data).

Inspiration:

This is an interesting study and I myself wanted to use this breast cancer proteome data set for other types of analyses using machine learning that I am performing as a part of my PhD. However, I though that the Kaggle community (or at least that part with biomedical interests) would enjoy playing with it. I added a simple K-means clustering example for that data with some comments, the same approach as used in the original paper.
One thing is that there is a panel of genes, the PAM50 which is used to classify breast cancers into subtypes. This panel was originally based on the RNA expression data which is (in my opinion) not as robust as the measurement of mRNA's final product, the protein. Perhaps using this data set, someone could find a different set of proteins (they all have unique NP_/XP_ identifiers) that would divide the data set even more robustly? Perhaps into a higher numbers of clusters with very distinct protein expression signatures?

Example K-means analysis script:
http://pastebin.com/A0Wj41DP