This dataset is a carefully curated compilation of genomic data related to prostate cancer, created by merging four GEO Series (GSE) datasets from the NCBI database. The selection process involved identifying datasets associated with three specific GPL platforms (GPL570, GPL96, and GPL571) to ensure compatibility and consistency across the data. The final dataset provides a comprehensive resource for analyzing gene expression and genomic variations across different prostate cancer conditions.
Initially, several GSE datasets with different GPL platforms were downloaded from the NCBI database. After an in-depth analysis, it was determined that GPL570, GPL96, and GPL571 platforms share significant overlap. Therefore, datasets corresponding to these three platforms were selected for further analysis. The raw data for each GSE was downloaded from the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi). If a dataset lacked raw data, it was excluded from the selection.
The four selected GSE datasets are:
• GSE32448
• GSE46602
• GSE7307
• GSE69223
The selected datasets were categorized into three groups based on the "Gillison" classification:
• Normal
• Benign
• Tumor
Subsequently, the quality of the data was assessed using the MetaQC package in R. The results confirmed that these four datasets were of the highest quality and suitable for further analysis.
• First Row (GSE Names): The first row contains the names of the GEO Series (GSE) datasets from which the data were sourced. Each column corresponds to a specific GSE, indicating the origin of the sample data.
• Second Row (GSM Identifiers): The second row includes the GEO Sample (GSM) identifiers, which are unique identifiers for each sample within the GSE datasets. These identifiers help trace back each sample to its original source in the NCBI GEO database.
• Gillison Classification (Parameter after ): The numbers following the underscore () in the second row represent the Gillison classification of each sample. This classification categorizes the samples into three groups based on their clinical status:
1: Normal
2: Benign
3: Tumor
The purpose of this dataset is to provide a high-quality, integrated resource for researchers aiming to understand the molecular mechanisms of prostate cancer. It can be used for various applications, including identifying biomarkers, exploring gene expression patterns, and developing predictive models.
Researchers can utilize this dataset to:
• Investigate gene expression differences across Normal, Benign, and Tumor samples.
• Develop and validate predictive models for prostate cancer progression.
• Explore potential biomarkers for early detection and treatment.
Special thanks to the researchers who originally published the GSE datasets and to the NCBI GEO database for making these data publicly available. This dataset was compiled with great care to provide a valuable resource for the scientific community.