Diff of /thesis/3_methods.tex [000000] .. [7e66db]

Switch to unified view

a b/thesis/3_methods.tex
1
\section{Methods}
2
3
The purpose of this chapter is twofold. It will discuss the design and processing decisions that were made for the development of this project, while also giving critical insight into how these applied technologies work. The sections for setup and data set describe the working environment for this study. Preprocessing, architecture, training and postprocessing show in chronological order how the development was addressed.
4
5
\subsection{Setup}
6
7
The workstation included an Intel i5 processor, 16GB of RAM and most importantly a NVIDIA GeForce GTX1060/6GB. Neural Networks can be trained more efficiently on GPUs than CPUs. This is because the simpler but highly parallelized architecture of graphics chips plays in favor for the needed calculations in deep learning \cite{NVIDIA}.
8
9
The computer ran Ubuntu 16.04 with the NVIDIA CUDA and cuDNN libraries installed to take advantage of the GPU. As the primary programming language Python 2 was chosen due to its simple syntax and popularity in the deep learning field. The code was briefly tested with Python 3 as well and seemed to work. Keras was used as the framework for training models because its high level syntax allows fast prototyping and testing. The development environment was a Jupyter Notebook, which allowed a flexible execution of code snippets instead of running the entire program for every single change.
10
11
The processing of medical image data needed a library that could handle these formats. SimpleITK is a Python binding to the ITK library written in C++. It includes many tools for image processing and is especially popular in the medical field. Other libraries were also used for smaller tasks.
12
13
\subsection{Data Set}
14
15
The data set is a collection of three dimensional, T1 weighted and non-fat-saturated MR images showing the right knee. The number of available samples grew during the project. For most of the development time, it included 150 samples that came from 3 different sources. All images were provided in the MHD file format, which is very common in the medical field. It features lossless compression, as well as technical meta information about each image. The resolution of these samples differed between sources.
16
17
\begin{table}[h!]
18
\centering
19
\begin{tabular}{ | l | l | l | r | r | l |}
20
  \hline
21
  Source & Prospective & View & Samples & Maps & Resolution \\
22
  \Xhline{3\arrayrulewidth}
23
  Class 1 & Yes        & Coronal     & 65      & 36            & 800x800x41 \\
24
  \hline
25
  Class 2 & Yes        & Coronal     & 80      & 40            & 512x512x24 \\
26
  \hline
27
  Class 3 & No         & Sagittal    & 5       & 0             & multiple \\
28
  \hline
29
\end{tabular}
30
\caption{Details of the data set separated by source}
31
\end{table}
32
33
Class 3 data featured a sagittal perspective, while all other samples were taken coronal. This mattered because the resolution of all images was roughly 20 times lower on the z-axis. The 80 class 2 images originated in a previous study that investigated the condition of growth plates led by Jopp et al. A total of 65 class 1 images were taken as part of the DFG project, which showed the highest resolution due to a different MRI machine.
34
35
\begin{figure}[H]
36
\minipage{0.32\textwidth}
37
  \includegraphics[width=\linewidth]{imgs/3.jpg}
38
\endminipage\hfill
39
\minipage{0.32\textwidth}
40
  \includegraphics[width=\linewidth]{imgs/10.jpg}
41
\endminipage\hfill
42
\minipage{0.32\textwidth}%
43
  \includegraphics[width=\linewidth]{imgs/17.jpg}
44
\endminipage
45
\vspace{0.2cm}
46
\minipage{0.32\textwidth}
47
  \includegraphics[width=\linewidth]{imgs/24.jpg}
48
\endminipage\hfill
49
\minipage{0.32\textwidth}
50
  \includegraphics[width=\linewidth]{imgs/31.jpg}
51
\endminipage\hfill
52
\minipage{0.32\textwidth}%
53
  \includegraphics[width=\linewidth]{imgs/38.jpg}
54
\endminipage
55
\caption{Six different slices of one 3D MRI}
56
\end{figure}
57
58
The images above move through the slices of one 3D MRI. The Femur already becomes visible in the first frame, whereas the Tibia starts to show in the second frame. The Fibula becomes visible in the lower left corner of the fourth image and can be seen more clearly in the fifth.
59
60
A total of 76 segmentation maps were initially labeled by semi-automatic region growing followed by a manual adaption. The labeling process took 2 hours per 3D image, and each map showed Femur, Tibia, and Fibula separately. Not the entire image was segmented, but only a square window around the point where Femur and Tibia meet.
61
62
\begin{figure}[H]
63
\centering
64
\par
65
\includegraphics[width=1.0\textwidth]{imgs/age_distr.png}
66
\caption{Age distribution in the data set}
67
\par
68
\end{figure}
69
70
The candidates chosen for the MRI recordings were all german males between the age of 14 and 21. Their age was normally distributed, and a majority of the candidates were recorded twice roughly a year apart.
71
72
\subsection{Preprocessing}
73
74
The parameters in a Neural Network commonly range from tens of thousands to hundreds of millions. This complexity allows the model to learn on its own what features of an image are relevant for any given task. It works in conjunction with the fact that high volumes of data are available for the training. Because of the small data set that was available for this study, several types of preprocessing were applied to the images. For the most part, these techniques remove irrelevant information and reduce variance between multiple samples. Other preparation methods experimented with the difference between 2D and 3D data as well as the influence of separate segmentation channels on the output.
75
76
\subsubsection{Training, Validation, and Testing Sets}
77
78
The data was split into three subsets of which each was used for a different purpose (training, validation, and testing). The training set is commonly the largest of the three and contains the data that is applied to the actual learning process. It's the only portion of the data the network will draw direct conclusions from. 
79
80
The validation data is used to regularly measure the performance of the model and check whether overfitting occurs or not. If the accuracy of the validation set drops below the results of the training data, the network is starting to memorize the data it knows rather than generalizing on the concept.
81
82
The third subset is referred to as the testing data. In contrast to the validation set, it's only used once in the very end, to give a final score. The idea is that by building a model based on the validation results a certain amount of information bleed occurs, where the network will implicitly learn from the validation data. In order to prevent biased results, the testing data is used as a last performance reference.
83
84
\begin{itemize}
85
\item Traning Set: 74\% of the data
86
\item Validation Set: 13\% of the data
87
\item Test Set: 13\% of the data
88
\end{itemize}
89
90
\subsubsection{Cropping, Resizing, and Resampling}
91
92
The framing included vast areas of the upper and lower leg to be visible in the picture. Since these were not seen as relevant, they were cropped out. An existing algorithm was used to detect the center where Tibia and Femur meet and only use a square window around this point.
93
94
Although there is no theoretical size limitation to using convolutional neural networks, it is desirable to reduce the spatial resolution and therefore decrease the number of calculations. 224 x 224 pixels for each slice still gave enough detail to identify the shape of Femur, Tibia, and Fibula, while also being a common resolution for CNNs.
95
96
Resizing the z-axis was problematic because the resolution was roughly 20 times lower. When scaling along this dimension, the segmentation maps of different slices started to blend and create merged maps of multiple layers. In order to get the images from two different main sources on the same scale, two different approaches were tried.
97
98
The 41 slices per image of the class 1 data were padded with empty pixels to 48 slices. Afterward, every second 2D image was resampled to the same 24 slices the data from Jopp et al. showed. It was not possible to do it the other way around and upscale 24 slices to 48, because the interpolated segmentation maps may have falsified the data.
99
100
Throwing away perfectly good data is very uncommon in the machine learning field, especially if data sets are small. However, one slice shares a lot of similar information with its neighboring slices and can be seen as a sort of image augmentation between the two. By using the same spatial resolution for both sources, the balance between the amount of class 1 and 2 data was kept. A total of 76 images with a resolution of 24 x 224 x 224 were now available for training.
101
102
The second approach converted the volumetric data to 2D while using all 41 slices from the class 1 data and all 24 slices from class 2. No samples were thrown away, but the data was no longer volumetric. 
103
104
\subsubsection{Normalization}
105
106
The normalization of images refers to the process of transforming all samples on the same scale of values. Two techniques for this are popular in the field of deep learning. The first one is called feature scaling, where every sample is normalized between 0 and 1.
107
108
%\begin{figure}[H]
109
\[x' = \frac {x - min(x)}{max(x) - min(x)}\]
110
%\end{figure}
111
112
The second technique calculates the standard score, where the mean is subtracted from the samples and then divided by the standard deviation.
113
114
%\begin{figure}[H]
115
\[x' = \frac {x - \mu}{\sigma}\]
116
%\end{figure}
117
118
In this case, the mean and standard deviation are not calculated for every image, but for all training samples. This centers the intensities around the average brightness of the training data. When normalizing validation and test data, one would also use the mean and standard deviation from the training set because those are the measures the architecture is expecting.
119
120
The standard score presents a more desirable approach because its mean centering helps to prevent that parameters get abnormally large or get set to zero during training. It is also less sensitive to outliers e.g. extremely bright spots in the image, which frequently appeared in the form of high intensities.
121
122
In contrast to the regular use of the standard score, every image was normalized on its own mean and standard deviation instead of calculating it on the entire training data. Class 2 featured roughly five times higher intensities than the class 1 data because they were recorded on different MRI machines. Figure 9 shows two spikes in mean intensities that had to be dealt with. By normalizing each 3D image on its own values, these two distributions were unified. Furthermore, this procedure will normalize future samples on the same scale the network expects no matter what the bit depth of an image is.
123
124
\begin{figure}[H]
125
\centering
126
\par
127
\includegraphics[width=1.0\textwidth]{imgs/intensity_distr.png}
128
\caption{Intensity distribution in the data set}
129
\par
130
\end{figure}
131
132
This approach also has a downside. The network can no longer use the mean intensity and the standard deviation as a feature to learn from because every image will be identical according to these measures. However, these two features showed no correlation to the accuracy of a prediction, which is why it was decided to use this approach.
133
134
\subsubsection{N4 Bias Field Correction}
135
136
Due to the inhomogeneous magnetic field used in MRI machines, a bias field is present as a low-frequency non-uniform intensity in the acquired images. Several methods have been developed in the past of which N4ITK is the de facto standard in the field \cite{Tustison2010}. As the name suggests, it is included in the Insight Toolkit (ITK) for which SimpleITK delivered an available Python binding. Using the N4 Bias Field Correction with its default settings on a single 24 x 224 x 224 image took between 60 and 120 seconds.
137
138
\begin{figure}[H]
139
\minipage{0.32\textwidth}
140
  \includegraphics[width=\linewidth]{imgs/orig.jpg}
141
\endminipage\hfill
142
\minipage{0.32\textwidth}
143
  \includegraphics[width=\linewidth]{imgs/corr.jpg}
144
\endminipage\hfill
145
\minipage{0.32\textwidth}%
146
  \includegraphics[width=\linewidth]{imgs/diff2.jpg}
147
\endminipage
148
\caption{Raw image, corrected image and intensity differences}
149
\end{figure}
150
151
\subsubsection{Augmentation}
152
153
Image augmentation is a popular approach to virtually increase the size of the data set. The general idea is that a neural network will overfit more when learning one sample $n$ times, in comparison to learning $n$ alternations of this sample just once. It helps to generalize on new images instead of memorizing patterns in the training data.
154
155
Lossless augmentation techniques are those that don't change the values and relative localities in a sample. In 2D these include vertical and horizontal flips, as well diagonal flips if width and height are equal. Note that any multiples of 90-degree rotations can also be created using a combination of these flips. Although the samples are changed as a whole, they do not vary concerning their contained values. 
156
157
The term "lossless" only refers to the technical change and not necessarily to the semantic change. Vertically flipping a slice of this data set switches the absolute positions of Femur and Tibia. This might make it harder for the network to distinguish between the two.
158
159
Lossy augmentation methods include a variety of image transformations. Common choices include:
160
161
\begin{itemize}
162
\item Horizontal shifts
163
\item Vertical shifts
164
\item Rotations
165
\item Shear mapping
166
\item Brightness adjustments
167
\item Contrast adjustments
168
\item Gamma adjustments
169
\end{itemize}
170
171
For this study horizontal flips were implemented to give the impression that images from both knees were available. In addition to this, these images were shifted 24 pixels on the horizontal axis to add another type of augmentation. 
172
173
Interestingly, these methods did not improve the accuracy of the model. The horizontal flips even hurt the performance and were therefore removed. The horizontal shifts were kept in the training set, so the network would perform better on unseen data that was not perfectly aligned.
174
175
\subsubsection{2D and 3D data}
176
177
Although convolutional neural networks are most commonly connected with 2D data such as photos or drawings, they can be applied in any dimensional space. In Natural Language Processing (NLP) 1D convolutions are often used on sentences and in finance they can be applied to time series forecasting. In the medical field where a variety of volumetric data exists, 3D convolutions have become a popular choice for building deep learning solutions.
178
179
In the context of neural networks, one has to differentiate between volumetric data with one channel and 2D data that consists of multiple color channels. Both are an example of 3D data, but the volumetric images feature three spatial dimensions, whereas photos feature only two. Convolutions commonly traverse the spatial dimensions, which means photos with multiple channels are usually not subject to 3D convolutions. Instead, various input channels are fed into a 2D CNN.
180
181
Although the data set consisted of volumetric MRI data, the z-axis showed 20 times fewer voxels than the other two dimensions. Not knowing the influence of this situation both 3D and 2D architectures were investigated for this project. The 3D model didn't use any MaxPooling on the already small z-axis. A kernel size of 3 x 3 x 3 was used similarly to the 3D version of U-Net \cite{Cicek2016}. The 2D data was created by using each slice as a single input image.
182
183
As described in 3.3.2 the 3D convolutional network used 24 x 224 x 224 voxel images from both the class 1 and 2 data. In contrast, the 2D network used all of the available slices from each source as separate images.
184
185
While in theory, the spatial information of the z-axis gave the 3D network more contextual information of every slice, the 2D model performed better at the end. The latter was also less computationally expensive and allowed working with data where only single slices were available per sample.
186
187
\subsubsection{Separate Bone Maps}
188
189
The initial segmentation maps came with three separate values for Femur, Tibia, and Fibula. With this information, it was possible to train a network that would segment the three bones while still differentiating between them.
190
191
\begin{figure}[H]
192
\minipage{0.24\textwidth}
193
  \includegraphics[width=\linewidth]{imgs/x1.png}
194
\endminipage\hfill
195
\minipage{0.24\textwidth}
196
  \includegraphics[width=\linewidth]{imgs/x2.png}
197
\endminipage\hfill
198
\minipage{0.24\textwidth}%
199
  \includegraphics[width=\linewidth]{imgs/x3.png}
200
\endminipage\hfill
201
\minipage{0.24\textwidth}%
202
  \includegraphics[width=\linewidth]{imgs/x4.png}
203
\endminipage
204
\vspace{0.15cm}
205
\minipage{0.24\textwidth}
206
  \includegraphics[width=\linewidth]{imgs/y1.png}
207
\endminipage\hfill
208
\minipage{0.24\textwidth}
209
  \includegraphics[width=\linewidth]{imgs/y2.png}
210
\endminipage\hfill
211
\minipage{0.24\textwidth}%
212
  \includegraphics[width=\linewidth]{imgs/y3.png}
213
\endminipage\hfill
214
\minipage{0.24\textwidth}%
215
  \includegraphics[width=\linewidth]{imgs/y4.png}
216
\endminipage
217
\caption{Examples of slices with their ground truth segmentations}
218
\end{figure}
219
220
However, this led to problems with the Fibula, which was not recognized by any of the applied architectures. The channel was always predicted empty. Its appearance in the images accounted for 3.4\% of the segmented area, whereas the Tibia made up 41\% and the Femur accounted for the remaining 55.6\%. This led the network to learn that an empty prediction of this channel is valid in 96.6\% of the cases, which is very accurate on its own. Instead of spending capacity on improving a channel that only makes up a tiny fraction of the result, the network improved the performance of Femur and Tibia.
221
222
The only successful method was training a separate model for each of the three bones. The network started to predict empty segmentation maps in the beginning because the actual masks only made up a fraction of the frame. The only way to improve from here was to find the Fibula in the non-empty slices and segment its region. This was different from the tests before, where the unsatisfactory performance of the Fibula was balanced by making more accurate segmentations of the other two bones.
223
224
For another experiment, the three channels were merged to create a network that would segment any long bone in the image. Not only did this solve the problem with the Fibula as well, but the predictions of Femur and Tibia were just as accurate. The network didn't need to learn any more that the global position in the image accounts for whether or not an object should be segmented. Instead, it could segment any long bone in any position in the frame.
225
226
While building the architecture in the following sections, this merged channel approach was used. It helped to speed up the design process because different settings only needed to be trained once instead of once for every bone.
227
228
\subsection{Architecture}
229
230
The majority of classification CNNs use a recurrent combination of convolutional and pooling layers \cite{Krizhevsky, He2015b, Iandola2016a, Ronneberger2015a}. By combining the local learning patterns of convolutions with the spatial reductions of pooling, the input is compressed to a dense representation of its most important features.
231
232
As mentioned in 2.2.5 a segmentation can be seen as a classification for every pixel. As such, early CNN segmentation models used small patches of the input image only to predict a single pixel through a classification pipeline. Afterward, a full segmentation map was assembled using each of these pixels. This process was very slow, and it also prevented the network to have a field of view larger than the inserted patch.
233
234
An improvement for segmentations came through architectures referred to as encoder-decoder models. The encoding process describes the same spatial compression used in classification networks. Afterward, in the decoding step, the spatial resolution is brought back to its original shape and further processed by additional convolutions. This yields huge speed improvements, while also increasing the accuracy of the prediction. Encoder-decoder networks are the de facto standard in the current field of image segmentation \cite{Chen2017, Zhao2016, Lin2016, Chen2016, Nekrasov2016, Milletari2016, Cicek2016, Ronneberger2015a, Badrinarayanan2015}, and they were the initial starting point for building architecture.
235
236
\begin{figure}[H]
237
\centering
238
\par
239
\includegraphics[width=1.0\textwidth]{imgs/unet.png}
240
\caption{Visualization of the U-Net architecture}
241
\par
242
\end{figure}
243
244
Figure 12 shows a particular implementation of an encoder-decoder model known as U-Net \cite{Ronneberger2015a}. It has been successfully used on a variety of projects in and outside the medical field. The contracting side on the left shows a similar architecture to classification CNNs, after which the process is mirrored on the expanding side.
245
246
The following subchapters will discuss key elements that go into the design process of such an architecture. It was also analyzed how technologies that were introduced after the U-Net paper could improve the performance on this data set.
247
248
\subsubsection{Channels, Growth Rate, and Depth}
249
250
The number of parameters in a neural network has a high correlation with its learning capacity. By adding more nodes that can be adjusted during training, the model can approximate a more complex function that transforms the input into the output. The downside is that a larger parameter count will also increase the possibility of overfitting the data. A convention in the field of CNNs is to gradually increase the number of channels, while the spatial resolution is reduced due to the use of MaxPooling \cite{Krizhevsky}\cite{He2015b}\cite{Iandola2016a}\cite{Ronneberger2015a}. U-Net also shows this behavior on the left side of its architecture.
251
252
A test was set up that compared the original U-Net against five smaller variants. They varied in the total number of parameters and how the number of channels changed from one layer to the next.
253
254
\begin{table}[H]
255
    \centering
256
    \begin{tabular}{| l | r | r | r | r | r | r | r | r | r | r |}
257
    \hline
258
    Name    & C1   & C2   & C3   & C4   & C5   & C6   & C7   & C8   & C9  & Para. \\ 
259
    \Xhline{3\arrayrulewidth}
260
    Model A &   32 &   32 &   32 &   32 &   32 &   32 &   32 &   32 &  32 & 195k      \\
261
    \hline
262
    Model B &   16 &   24 &   36 &   54 &   81 &   54 &   36 &   24 &  16 & 332k      \\
263
    \hline
264
    Model C &   48 &   48 &   48 &   48 &   48 &   48 &   48 &   48 &  48 & 437k      \\
265
    \hline
266
    Model D &    8 &   16 &   32 &   64 &  128 &   64 &   32 &   16 &   8 & 491k      \\
267
    \hline
268
    Model E &   32 &   48 &   72 &  108 &  159 &  108 &   72 &   48 &  32 & 1.33m      \\
269
    \hline
270
    U-Net   &   64 &  128 &  256 &  512 & 1024 &  512 &  256 &  128 &  64 & 31.0m      \\
271
    \hline
272
    \end{tabular}
273
    \caption{Convolutional blocks with their respective output channels}
274
\end{table}
275
276
Training U-Net was very slow in two regards. Each training step lasted six times longer compared to the smallest model, while it also took 25 times more training steps to reach the same results. Since it was also overfitting to a significant amount, it was not trained until the end. The second largest model E also overfitted on the training data and never achieved comparable results to the other four models.
277
278
Model D started with eight channels which were then increased by a factor of 2, whereas B started out larger but only increased the channels by a rate of 1.5. Although being the smaller model, B showed a higher accuracy in the end. It was concluded that the initial number of output channels has a larger effect on the result than the growth rate.
279
280
This theory was supported by the results from A and C, both of which kept their initial number of channels throughout the network. These two showed the highest scores in comparison to the other models while maintaining their relatively small size. Since their results were identical, model A was kept because of its faster training speed. It was interesting to see that the smallest model performed best across the candidates.
281
282
The original U-Net featured a depth of 5 convolutional blocks until reaching the bottom of the "U"-shape. It was also investigated that a depth of 6 would not improve performance, while a depth of 4 decreased the accuracy. Therefore, Model A was left unchanged.
283
284
\subsubsection{Skip Connections}
285
286
U-Net uses skip connections to copy values from the left side to the corresponding layer on the right side. This improves performance because otherwise spatial information would get lost while contracting the input. An experiment showed that removing these paths indeed hurt the performance badly.
287
288
ResNet \cite{He2015b} is another popular classification architecture that was the first to use residual connections. These are another type of skip connection that bridge the beginning and end of a convolutional block. According to the original paper, they are also one of the main reasons CNNs can be expanded to depths of thousands of layers. Since their use in the industry has found great success in many models, they were also applied in this network. However, they did not have an impact on the performance or the training speed, which is why they were not used in the end.
289
290
\subsubsection{Dropout}
291
292
Dropout is a popular regularization technique that randomly zeros out a fraction of the nodes during training \cite{Srivastava2014}. It is understood that this helps the model to generalize better and reduce overfitting on a given data set. Well known image classification architectures like VGG16 \cite{Simonyan2014a}, SqueezeNet \cite{Iandola2016a} or AlexNet \cite{Krizhevsky} use Dropout near the end of the network. Similarly, U-Net uses Dropout at the end of the contracting path to prevent overfitting.
293
294
Since a single unit of dropout with a rate of 0.5 is common in other architectures, this was also chosen as a first option. Other tests included adding multiple dropout units between the convolutions on the contracting side, which led to slower training and lower scores. Adding dropout on the expanding side is rather unusual and also didn't perform well in the tests. In the end, the first candidate was chosen for future training.
295
296
\subsubsection{Activation Functions}
297
298
Activation functions sit between layers in a network to introduce a non-linearity. Otherwise, the operations could be regenerated to a simple linear transformation and remove the benefits of building a deep model. The rectified linear unit or ReLU \cite{Nair} is the default recommendation for an activation function in modern neural networks. It's defined as the maximum of 0 and the input value and can be described as a nonlinear function made up of two linear pieces. Because of this, it preserves many of the properties that make models easy to optimize and generalize well on new data \cite{Goodfellow2016}.
299
300
\[
301
relu(x) =
302
\begin{cases} 
303
0 & \text{if } x < 0  \\
304
x & \text{if } x \geq 0
305
\end{cases}
306
\]
307
308
\[
309
elu(x) =
310
\begin{cases} 
311
e^x - 1 & \text{if } x < 0  \\
312
x & \text{if } x \geq 0
313
\end{cases}
314
\]
315
316
Since the introduction of ReLU other variants have built on its success like PReLU \cite{He2015a}, LReLU and ELU \cite{Clevert2015}. The latter titled exponential linear unit shows the same linear behavior for positive values but adds an exponential function to negative inputs instead of erasing its value.  This has shown further improvements by keeping the mean of values centered at 0 while only being slightly more computationally expensive. ELU showed superior results in comparison to ReLU and was chosen for all future tests.
317
318
\begin{figure}[H]
319
\centering
320
\par
321
\includegraphics[width=0.53\textwidth]{imgs/elu_relu5.png}
322
\caption{Visualization of ReLU and ELU}
323
\par
324
\end{figure}
325
326
\subsection{Training}
327
328
The training is where the actual learning takes place, and its process is inspired by how humans improve their performance on tasks. When opposed to a difficult question the first step would be a mere guess of what the answer might be. In a second phase, this information is compared to the right answer and processed by the brain accordingly.
329
330
Similarly, the training of a neural network consists of two parts. In the first step, the input is fed forward through the network, and an answer prediction is made. In the very beginning, this is just a random guess because ANNs are initialized with random values. 
331
332
\begin{figure}[H]
333
\centering
334
\par
335
\includegraphics[width=1.0\textwidth]{imgs/forward_backward_prop.png}
336
\caption{The iterative process of forward and backward propagation}
337
\par
338
\end{figure}
339
340
The prediction is then compared to the right answer, and the error is calculated. This metric can be fed back through the network to optimize the parameters in the model. These two steps are called forward and backward propagation.
341
342
\subsubsection{Metrics and Loss Functions}
343
344
Metrics are used in deep learning to measure the performance of a model. For example, the accuracy is often chosen to describe how well a neural network is doing on a classification task. An accuracy of 0.9 indicates that 9 out of 10 samples are classified correctly.
345
346
\[
347
\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}
348
\]
349
350
In the formula above T and F indicate whether a prediction was true or false. P and N stand for a positive or negative outcome.
351
352
The result of a loss function is a metric that will be minimized during the backward propagation process. It needs to be a differentiable function, which is why the accuracy cannot be used as a loss function. It is a binary metric that works with true or false values and not with probabilities.
353
354
In situations like these, a surrogate function is used that has a high correlation with the target metric. For classification problems, this is often the cross entropy. Because a segmentation can be seen as a classification for every output pixel, it was also chosen as a candidate for this study.
355
356
\[
357
\text{Cross Entropy} \ (p, q) = -\sum \ p(x) \ log \ q(x)
358
\]
359
360
Another option was the $F_1$ score, which is a particular implementation of the F-Measure when $ \beta $ is 1. It is also known as the dice similarity coefficient (DSC).
361
362
\[
363
F_\beta \ \text{Score}= \frac{(1 + \beta^2) TP}{(1+\beta^2)TP+\beta^2FN+FP}
364
\]
365
366
Although the $F_1$ score is commonly applied as a binary measure and therefore not differentiable, a "soft" version can be used that accounts for continuous probabilities. To use it as a loss function where 0 describes the best possible outcome the $F_1$ loss was defined as $ 1 - F_1 \ \text{score} $.
367
368
\[
369
F_\beta \ \text{Loss} = 1 - F_\beta \ \text{Score}
370
\]
371
372
The value of $ \beta $ can be adjusted to change the emphasize between precision and recall. Precision describes how much of the predicted area was true, whereas recall describes how much of the ground truth was recognized by the model. This can be useful for data sets with large imbalances between classes, like this study showed between Fibula, Femur, and Tibia. 
373
374
The $F_2$ score was tried for the three channel segmentation, where the Fibula could not be recognized. Although this model showed better scores regarding the recall measure, the Fibula was still not segmented.
375
376
\[
377
\text{Precision} = \frac{TP}{TP+FP}
378
\]
379
380
\[
381
  \text{Recall} = \frac{TP}{TP+FN}
382
\]
383
384
\[
385
  \text{Total Error} = \frac{FP+FN}{TP+TN+FP+FN}
386
\]
387
388
Both loss functions were used in separate runs to compare the results of the $F_1$ score and the cross entropy. The $F_1$ trained model showed better results regarding the $F_1$ score and the cross entropy model showed better results on the cross entropy loss. In regards to Precision, Recall, Accuracy, and Intersection-over-Union (IoU) the performance of the $F_1$ trained network showed higher scores.
389
390
\[
391
\text{IoU}= \frac{TP}{TP+FP+FN}
392
\]
393
394
Another test run used both loss function at the same time. This was done by feeding the output of the next-to-last layer into two separate segmentation outputs. Each of these last layers was trained with the $F_1$ loss and cross entropy loss respectively. All previous layers were therefore trained by the sum of both loss functions. This architecture matched the best results of the previous runs in a single model but also increased the training time by 30\%.
395
396
Evaluating these two outputs on the other metrics played in favor for the $F_1$ loss output. Combining both maps didn't improve the scores any further. Since the cross entropy was chosen as a surrogate function and not as a performance metric, all future tests used only the $F_1$ loss. This delivered faster training than the dual output method.
397
398
\subsubsection{Optimizer}
399
400
The previous chapter provided an overview of the loss function, which measures the performance of a prediction during the training process. This chapter discusses how the result can be used to execute the actual optimization step.
401
402
The derivative of a single variable function defines the slope of this function at any given point. Knowing this, one can tell in which direction the original function declines. Adjusting the independent variable along the descent of a loss function will minimize the error in respect to its dependent variable.
403
404
The gradient is the derivative for functions of multi-dimensional inputs, such as the loss functions used in deep learning. A process that minimizes its result is called gradient descent. While it is possible to determine its minimum analytically, it is intractable for artificial neural networks due to the high number of parameters \cite{Chollet2017}.
405
406
Instead, the stochastic gradient descent (SGD) is used which will take a random batch of the training data and iteratively adjust the parameters in small steps. SGD is the basis for all common ANNs, but over the years different variants were introduced to the field.
407
408
The Adam optimizer \cite{Kingma2014} is such a variant, which enhances SGD amongst other things by using what's called an "adaptive momentum estimation." This is also where its name derives from. It adaptively adjusts the learning rate which defines how much the parameters will be changed in one training step. By incorporating the previous and current shape of the slope, Adam can tremendously speed up the training.
409
410
\subsubsection{Batch Size}
411
412
The number of random samples per training step is referred to as the batch size. In the past, it was believed that larger batches led to something called the generalization gap \cite{Keskar2016}, where the accuracy of a model would drop if it was trained on unusually large batches. Recent work \cite{Hoffer2017} suggests other reasons for this decrease in accuracy. While common batch sizes range from 32 to 256, Goyal et al. showed accurate results using 8192 images per batch when training a model on ImageNet \cite{Goyal2017}.
413
414
Using batches smaller than 32 samples can introduce a different kind of problem. Having too few data points that don't represent the mean of the data well, may lead to slow and unstable training.
415
416
Based on hardware limitations the largest possible batch sizes ranged from 24 to 48 samples depending on the current architecture. Whatever could be fit into memory was used for these experiments. Exceptions occurred when working with 3D convolutions in which case the batch size had to be limited to just 4 samples.
417
418
\subsubsection{Learning Rate Policy}
419
420
One full iteration over the training samples is referred to as an epoch, and the learning rate policy describes how the learning rate is changed from one epoch to another. With the introduction of adaptive optimizers like Adam, there has been a lower emphasize on this topic because the learning rate is modified during training. Even though this reduces the number of possible defects, training time can often be saved with the right initial learning rate.
421
422
\newpage
423
424
Ten epochs were run at different learning rates to compare initial results and to examine the point at which the model wouldn't converge at all. 0.002 was the highest rate at which the model started training, but 0.001 resulted in the best score. Adding a decay that reduces the learning rate manually over time did not improve the results with the use of Adam.
425
426
\subsubsection{Early Stopping}
427
428
Neural networks will continuously minimize the loss on the training set. This result needs to be validated on data the network hasn't seen before. At a certain point during training, the performance on the validation set will start to decrease because the model is overfitting on the training data. The number of iterations to reach this point is dependent on many hyperparameters, as well as the random values the network has been initialized with. As such, it's difficult to calculate how long the training will need to reach its peak.
429
430
Early stopping is a simple technique that will end the training process as soon as the model stops improving on the validation data. To do this, a patience value is defined how long the network should continue training after the score has stopped increasing. This is important because not every epoch will lead to a new best score on the validation data.
431
432
For test runs in this study, a patience of 9 was selected, which meant the training would stop after 10 epochs without improvement. Depending on the architecture and other hyperparameters this point was reached after 30 to 60 epochs. For the last training with the final set of hyperparameters, the patience was increased to 19, which didn't improve the accuracy.  This was also a verification that the initial value of 9 was a good fit for this problem.
433
434
\subsection{Postprocessing}
435
436
After the training had finished, it was investigated how the results on the validation data could be improved any further. Since the predicted segmentations showed a small number of values between 0 and 1, a threshold of 0.3 showed the best results to create a fully binary map. Furthermore, any 2D slice that showed a prediction smaller than 2\% of the total frame was predicted as empty instead. This helped against the false prediction of tiny fractions in upper and lower layers. For the segmentation of the Fibula alone, this last step was not executed.
437
438
\subsection{Summary}
439
440
In summary, the architecture features 4 "Down Blocks" on the contracting side that are mirrored through 4 "Up Blocks" on the expanding side. Each level is connected to the other side to share information at the current spatial resolution.
441
442
\begin{figure}[H]
443
\centering
444
\par
445
\includegraphics[width=0.7\textwidth]{imgs/model.png}
446
\caption{The proposed architecture for the segmentation}
447
\par
448
\end{figure}
449
450
A Down Block consists of 2 convolutional layers each followed by an ELU activation function, after which A MaxPooling layer halves the width and height of the image. The Up Block looks similar but features a transposed convolutional layer in the beginning to upscale the resolution and no MaxPooling in the end. The Middle Block features a Dropout layer with a value of 0.5, which is surrounded by convolutional layers as well. All convolutional layers have 32 channels. The output layer includes another convolution that reduces the channels to 1, which represents the segmentation map.
451
452
\newpage
453
454
Other options that were tried but didn't improve performance included using convolutional blocks with a stride of 2 instead of MaxPooling to down sample the images. Batch normalization was also applied but resulted in a high amount of overfitting the network never recovered from. The fact that it did not improve the performance came as a surprise because it usually improves any model. A possible reason for this could be the small batch size in combination with a high variance of intensities in the data. If the mean of each batch varies significantly from the mean of the entire training data, it may hurt the performance of batch normalization.
455
456
\newpage