In this project, we will use a standard imbalanced machine learning dataset referred to as the mammography dataset or sometimes Woods Mammography. The dataset is credited to Kevin Woods, et al. and the 1993 paper titled Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography. The focus of the problem is on detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram.
The dataset involved first started with 24 mammograms with a known cancer diagnosis that were scanned. The images were then pre-processed using image segmentation computer vision algorithms to extract candidate objects from the mammogram images. Once segmented, the objects were then manually labeled by an experienced radiologist.
A total of 29 features were extracted from the segmented objects thought to be most relevant to pattern recognition, which was reduced to 18, then finally to six, as follows (taken directly from the paper):
Area of object (in pixels).
Average gray level of the object.
Gradient strength of the object’s perimeter pixels.
Root mean square noise fluctuation in the object.
Contrast, average gray level of the object minus the average of a two-pixel wide border surrounding the object.
A low order moment based on shape descriptor.
There are two classes and the goal is to distinguish between microcalcifications and non-
microcalcifications using the features for a given segmented object.
Non-microcalcifications: negative case, or majority class.
Microcalcifications: positive case, or minority class.