IMAGE CREDIT: Anh Vo
In the first part of this series: Introduction to convolutional neural networks in Caffe, I covered the steps to recreate the basics of the convolutional neural network proposed in the paper: Acute Myeloid Leukemia Classification Using Convolution Neural Network In Clinical Decision Support System.
In this article I will cover the steps required to create the dataset required to train the model using the network we defined in the previous tutorial. The article will cover the paper exactly, and will use the original 108 image dataset (ALL_IDB1).
A reminder that we use the ALL_IDB1 dataset from Acute Lymphoblastic Leukemia Image Database for Image Processing dataset, to use this dataset you must request access by visiting this page.
This is the second part of a series of articles that will take you through my experience building a custom classifier with Caffe* that should be able to detect Acute Lymphoblastic Leukemia (ALL). I chose Caffe as I enjoyed working with it in a previous project, and I liked the intuitivity of defining the layers using prototxt files, however my R&D will include replicating both the augmentation script and the classifier using different languages and frameworks to compare results.
Before we begin, we can do some more visualization for the network we created in the previous article. We cloned the ALL-Classifiers-2019 repository and should have run the Setup.sh script in the allCNN project directory root. There have been some updates to the files in this repository so you should make sure you have the latest files. We can use the following command to view information about the network:
python3.5 Info.py NetworkInfo
We can save the network using:
python3.5 Info.py Save
We can also do some more visualization, using the following command we can loop through all of the 30 neurons in the conv1 and conv2 layers, saving the images inside the neurons to disk. Running the following command will write the output images to the Model/Output directory (conv1 & conv2).
python3.5 Info.py Outputs
Figure 1. conv1 layer neuron output images
Figure 2. conv2 layer neuron output images
The first thing we need to do is to sort our training and validation data. In the paper the authors state that they used the full 108 image dataset, ALL_IDB1. The paper shows that a training dataset of 80 images was used, and a validation dataset of 28. First of all we need to resize the dataset to 50px x 50px to match the input dimensions of our network, this process is handled by the functions provided in CaffeHelpers.py.
This article introduces two additional scripts in the allCNN project, Data.py & Classes/CaffeHelpers.py. These files will help us sort our data into training and validation sets, and create the LMDB databases required by Caffe.
LMDB or Lightning Mapped Database is used by Caffe to store our training/validation data and labels. In the Classes/CaffeHelpers.py file you will find a few functions that will help you convert your dataset into an LMDB database. Data.py is basically a wrapper around these functions which will do everything you need to do to create your LMDBs.
First of all, you need to upload the ALL_IDB1 dataset to the Model/Data/Train/0 and Model/Data/Train/1 directories, to do this you can take the positive images from ALL_IDB1 (ending in _1.jpg) and add to the Model/Data/Train/1 directory, then do the same for the negative images (ending in _0.jpg).
The training and validation sets proposed in the paper are as follows:
Figure 3. Training / testing data from paper (Source)
Figure 4. Training/validation data sorting output
In this article we are wanting to replicate the training and validation dataset sizes used in the paper. To ensure we get the correct training and validation sizes we use helper classes that I wrote that are provided in the Classes directory.
A reminder that we use the ALL_IDB1 dataset from Acute Lymphoblastic Leukemia Image Database for Image Processing dataset, to use this dataset you must request access by visiting this page.
If you have the latest code from the repository, you should have the file: Classes/CaffeHelpers.py in the allCNN project directory. This Python class will be used to handle Caffe related tasks for our network. The three main functions that we use in CaffeHelpers are recreatePaperData(), createTrainingLMDB(), createValidationLMDB() and computeMean().
recreatePaperData() is the function that replicates the training and validation datasets using the sizes mentioned in the paper.
createTrainingLMDB() is the function that converts our training dataset into an LMDB database.
createValidationLMDB () is the function that converts our validation dataset into an LMDB database.
computeMean() is the function that removes the mean of each image.
[Data.py]Data.py "Data.py") in the allCNN project directory provides an easy way to run the required functions for sorting our dataset. This file is basically a wrapper around the CaffeHelpers class.
Have a quick look through the source code to familiarize yourself with what is going on, then assuming you are in the allCNN project root use the following command to run the data sorting process.
python3.5 Data.py
As shown in Figure 4, we have now created training and validation datasets that match the ones used in the paper. In the next article we will train the convolutional neural network using this dataset.
Introduction to convolutional neural networks in Caffe* - Adam Milton-Barker
Preparing the Acute Lymphoblastic Leukemia dataset - Adam Milton-Barker
The Peter Moss Acute Myeloid & Lymphoblastic Leukemia AI Research project encourages and welcomes code contributions, bug fixes and enhancements from the Github.
Please read the CONTRIBUTING document for a full guide to forking our repositories and submitting your pull requests. You will also find information about our code of conduct on this page.
We use SemVer for versioning.
This project is licensed under the MIT License - see the LICENSE file for details.
We use the repo issues to track bugs and general requests related to using this project.