Card

About Dataset

Breast Cancer Big Data

The CSAW-CC dataset includes mammography images from breast cancer screenings at Karolinska University Hospital (Stockholm, Sweden) collected between 2008 and 2015. It contains data for both breast cancer patients and healthy controls, which aims to facilitate the development of AI systems for early breast cancer detection, classification, and prognostics. It includes detailed annotations of lesions by radiologists, which are crucial for training AI models, such as convolutional neural networks (CNNs), to detect early-stage cancer and differentiate between benign and malignant tumors.

Key Dataset Features

  • Columns: Image ID, age, lesion type, and pixel-level tumor annotations.
  • Data Scope: Over 1,100 cancer cases and 10,000 healthy controls.
  • Purpose: To improve AI-driven breast cancer detection and risk prediction.

Columns

  • Image ID: Unique identifier for each mammographic image.
  • Age: Age of the patient at the time of the screening.
  • Screening Date: The date when the screening was performed.
  • Lesion Type: Classification of lesions into benign or malignant.
  • Image: Mammogram images that serve as the primary input for training AI models.
  • Annotations: Pixel-level annotations of tumors, including the precise location of detected lesions and micro-calcifications, drawn by expert breast radiologists. Annotations for some images before diagnosis provide the predicted location of potential tumors.

Dataset Ethical Considerations:

  • The dataset was reviewed and approved by the Ethical Review Board of Stockholm. The board waived the need for individual informed consent under ethical permission number EPN 2016/2600-31.
  • Ethical Oversight: Additional ethical reviews were conducted and approved by the Ethical Review Authority of Sweden under permission EPM 2019-01946.