About Dataset

The Genomics of Drug Sensitivity in Cancer (GDSC) dataset is a valuable resource for therapeutic biomarker discovery in cancer research. This dataset combines drug response data with genomic profiles of cancer cell lines, allowing researchers to investigate the relationship between genetic features and drug sensitivity.

Task:

The primary task associated with this dataset is to predict drug sensitivity (measured as IC50 values) based on genomic features of cancer cell lines. This can involve regression tasks to predict exact IC50 values or classification tasks to categorize cell lines as sensitive or resistant to specific drugs. The dataset also allows for the identification of genomic markers that correlate with drug response.

Files:

GDSC2-dataset.csv: Contains drug sensitivity data, including IC50 values, for various drugs tested against cancer cell lines.(Original source file)
Cell_Lines_Details.xlsx: Provides detailed information about the cancer cell lines, including genomic features such as mutations, copy number alterations, and gene expression. (Original source file)
Compounds-annotation.csv: Offers information about the drugs used in the screening, including their targets and pathways. (Original source file)
GDSC_DATASET.csv: This is the main dataset file for analysis. It's a merged file combining key information from the above three files, created to facilitate easier analysis. This consolidated dataset includes all necessary features for drug sensitivity prediction and is recommended for use in your analysis.

Detailed Column Descriptions:

1. GDSC2-dataset.csv:

DATASET: Identifier for the specific GDSC dataset version.
NLME_RESULT_ID: Unique identifier for the non-linear mixed effects model result.
NLME_CURVE_ID: Identifier for the dose-response curve fitted by NLME.
COSMIC_ID: Unique identifier for the cell line from the COSMIC database.
CELL_LINE_NAME: Name of the cancer cell line used in the experiment.
SANGER_MODEL_ID: Identifier used by the Sanger Institute for the cell line model.
TCGA_DESC: Description of the cancer type according to The Cancer Genome Atlas.
DRUG_ID: Unique identifier for the drug used in the experiment.
DRUG_NAME: Name of the drug used in the experiment.
PUTATIVE_TARGET: The presumed molecular target of the drug.
PATHWAY_NAME: The biological pathway affected by the drug.
COMPANY_ID: Identifier for the company that provided the drug.
WEBRELEASE: Date or version of web release for this data.
MIN_CONC: Minimum concentration of the drug used in the experiment.
MAX_CONC: Maximum concentration of the drug used in the experiment.
LN_IC50: Natural log of the half-maximal inhibitory concentration (IC50).
AUC: Area Under the Curve, a measure of drug effectiveness.
RMSE: Root Mean Square Error, indicating the fit quality of the dose-response curve.
Z_SCORE: Standardized score of the drug response, allowing comparison across different drugs and cell lines.

2. Cell_Lines_Details.xlsx:

Sample Name: Unique identifier for the cell line sample.
COSMIC identifier: Unique ID from the COSMIC database for the cell line.
Whole Exome Sequencing (WES): Genetic mutation data from whole exome sequencing.
Copy Number Alterations (CNA): Data on gene copy number changes in the cell line.
Gene Expression: Information on gene expression levels in the cell line.
Methylation: Data on DNA methylation patterns in the cell line.
Drug Response: Information on how the cell line responds to various drugs.
GDSC Tissue descriptor 1: Primary tissue type classification.
GDSC Tissue descriptor 2: Secondary tissue type classification.
Cancer Type (matching TCGA label): Cancer type according to TCGA classification.
Microsatellite instability Status (MSI): Indicates the cell line's MSI status.
Screen Medium: The growth medium used for culturing the cell line.
Growth Properties: Characteristics of how the cell line grows in culture.

3. Compounds-annotation.csv:

DRUG_ID: Unique identifier for the drug.
SCREENING_SITE: Location where the drug screening was performed.
DRUG_NAME: Name of the drug compound.
SYNONYMS: Alternative names for the drug.
TARGET: The molecular target(s) of the drug.
TARGET_PATHWAY: The biological pathway(s) targeted by the drug.

Target Variable:

The primary target variable in this dataset is LN_IC50 (Natural log of the half-maximal inhibitory concentration). This variable represents the concentration of a drug that inhibits cell viability by 50%, measured on a logarithmic scale. Lower LN_IC50 values indicate higher drug sensitivity, making it a crucial metric for evaluating the effectiveness of anti-cancer drugs against various cancer cell lines.

Data Collection:

The data was collected by the Genomics of Drug Sensitivity in Cancer project, a collaboration between the Sanger Institute (UK) and the Massachusetts General Hospital Cancer Center (USA). The project involves large-scale screening of human cancer cell lines with a wide range of anti-cancer drugs. Data was collected through large-scale screening of human cancer cell lines with various anti-cancer drugs. Cell viability was measured using CellTiter-Glo assay after 72 hours of drug treatment.
The datasets can be accessed and downloaded from the GDSC website: GDSC Database.

Key Points:

The dataset covers over 1000 human cancer cell lines and hundreds of anti-cancer drugs.
Genomic features include gene mutations, copy number variations, and gene expression levels.
Drug response is primarily measured using IC50 values (the concentration of a drug that reduces cell viability by 50%).
The data can be used to identify biomarkers of drug sensitivity and to develop personalized cancer treatment strategies.
Regular updates are made to the dataset, expanding its coverage and improving data quality.
This dataset is crucial for researchers in cancer pharmacogenomics, allowing them to explore how genetic variations in cancer cells affect their response to different drugs. It has potential applications in drug discovery, personalized medicine, and understanding mechanisms of drug resistance in cancer.