[7e66db]: / thesis / 3_methods.tex

Download this file

457 lines (315 with data), 39.8 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
\section{Methods}
The purpose of this chapter is twofold. It will discuss the design and processing decisions that were made for the development of this project, while also giving critical insight into how these applied technologies work. The sections for setup and data set describe the working environment for this study. Preprocessing, architecture, training and postprocessing show in chronological order how the development was addressed.
\subsection{Setup}
The workstation included an Intel i5 processor, 16GB of RAM and most importantly a NVIDIA GeForce GTX1060/6GB. Neural Networks can be trained more efficiently on GPUs than CPUs. This is because the simpler but highly parallelized architecture of graphics chips plays in favor for the needed calculations in deep learning \cite{NVIDIA}.
The computer ran Ubuntu 16.04 with the NVIDIA CUDA and cuDNN libraries installed to take advantage of the GPU. As the primary programming language Python 2 was chosen due to its simple syntax and popularity in the deep learning field. The code was briefly tested with Python 3 as well and seemed to work. Keras was used as the framework for training models because its high level syntax allows fast prototyping and testing. The development environment was a Jupyter Notebook, which allowed a flexible execution of code snippets instead of running the entire program for every single change.
The processing of medical image data needed a library that could handle these formats. SimpleITK is a Python binding to the ITK library written in C++. It includes many tools for image processing and is especially popular in the medical field. Other libraries were also used for smaller tasks.
\subsection{Data Set}
The data set is a collection of three dimensional, T1 weighted and non-fat-saturated MR images showing the right knee. The number of available samples grew during the project. For most of the development time, it included 150 samples that came from 3 different sources. All images were provided in the MHD file format, which is very common in the medical field. It features lossless compression, as well as technical meta information about each image. The resolution of these samples differed between sources.
\begin{table}[h!]
\centering
\begin{tabular}{ | l | l | l | r | r | l |}
\hline
Source & Prospective & View & Samples & Maps & Resolution \\
\Xhline{3\arrayrulewidth}
Class 1 & Yes & Coronal & 65 & 36 & 800x800x41 \\
\hline
Class 2 & Yes & Coronal & 80 & 40 & 512x512x24 \\
\hline
Class 3 & No & Sagittal & 5 & 0 & multiple \\
\hline
\end{tabular}
\caption{Details of the data set separated by source}
\end{table}
Class 3 data featured a sagittal perspective, while all other samples were taken coronal. This mattered because the resolution of all images was roughly 20 times lower on the z-axis. The 80 class 2 images originated in a previous study that investigated the condition of growth plates led by Jopp et al. A total of 65 class 1 images were taken as part of the DFG project, which showed the highest resolution due to a different MRI machine.
\begin{figure}[H]
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{imgs/3.jpg}
\endminipage\hfill
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{imgs/10.jpg}
\endminipage\hfill
\minipage{0.32\textwidth}%
\includegraphics[width=\linewidth]{imgs/17.jpg}
\endminipage
\vspace{0.2cm}
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{imgs/24.jpg}
\endminipage\hfill
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{imgs/31.jpg}
\endminipage\hfill
\minipage{0.32\textwidth}%
\includegraphics[width=\linewidth]{imgs/38.jpg}
\endminipage
\caption{Six different slices of one 3D MRI}
\end{figure}
The images above move through the slices of one 3D MRI. The Femur already becomes visible in the first frame, whereas the Tibia starts to show in the second frame. The Fibula becomes visible in the lower left corner of the fourth image and can be seen more clearly in the fifth.
A total of 76 segmentation maps were initially labeled by semi-automatic region growing followed by a manual adaption. The labeling process took 2 hours per 3D image, and each map showed Femur, Tibia, and Fibula separately. Not the entire image was segmented, but only a square window around the point where Femur and Tibia meet.
\begin{figure}[H]
\centering
\par
\includegraphics[width=1.0\textwidth]{imgs/age_distr.png}
\caption{Age distribution in the data set}
\par
\end{figure}
The candidates chosen for the MRI recordings were all german males between the age of 14 and 21. Their age was normally distributed, and a majority of the candidates were recorded twice roughly a year apart.
\subsection{Preprocessing}
The parameters in a Neural Network commonly range from tens of thousands to hundreds of millions. This complexity allows the model to learn on its own what features of an image are relevant for any given task. It works in conjunction with the fact that high volumes of data are available for the training. Because of the small data set that was available for this study, several types of preprocessing were applied to the images. For the most part, these techniques remove irrelevant information and reduce variance between multiple samples. Other preparation methods experimented with the difference between 2D and 3D data as well as the influence of separate segmentation channels on the output.
\subsubsection{Training, Validation, and Testing Sets}
The data was split into three subsets of which each was used for a different purpose (training, validation, and testing). The training set is commonly the largest of the three and contains the data that is applied to the actual learning process. It's the only portion of the data the network will draw direct conclusions from.
The validation data is used to regularly measure the performance of the model and check whether overfitting occurs or not. If the accuracy of the validation set drops below the results of the training data, the network is starting to memorize the data it knows rather than generalizing on the concept.
The third subset is referred to as the testing data. In contrast to the validation set, it's only used once in the very end, to give a final score. The idea is that by building a model based on the validation results a certain amount of information bleed occurs, where the network will implicitly learn from the validation data. In order to prevent biased results, the testing data is used as a last performance reference.
\begin{itemize}
\item Traning Set: 74\% of the data
\item Validation Set: 13\% of the data
\item Test Set: 13\% of the data
\end{itemize}
\subsubsection{Cropping, Resizing, and Resampling}
The framing included vast areas of the upper and lower leg to be visible in the picture. Since these were not seen as relevant, they were cropped out. An existing algorithm was used to detect the center where Tibia and Femur meet and only use a square window around this point.
Although there is no theoretical size limitation to using convolutional neural networks, it is desirable to reduce the spatial resolution and therefore decrease the number of calculations. 224 x 224 pixels for each slice still gave enough detail to identify the shape of Femur, Tibia, and Fibula, while also being a common resolution for CNNs.
Resizing the z-axis was problematic because the resolution was roughly 20 times lower. When scaling along this dimension, the segmentation maps of different slices started to blend and create merged maps of multiple layers. In order to get the images from two different main sources on the same scale, two different approaches were tried.
The 41 slices per image of the class 1 data were padded with empty pixels to 48 slices. Afterward, every second 2D image was resampled to the same 24 slices the data from Jopp et al. showed. It was not possible to do it the other way around and upscale 24 slices to 48, because the interpolated segmentation maps may have falsified the data.
Throwing away perfectly good data is very uncommon in the machine learning field, especially if data sets are small. However, one slice shares a lot of similar information with its neighboring slices and can be seen as a sort of image augmentation between the two. By using the same spatial resolution for both sources, the balance between the amount of class 1 and 2 data was kept. A total of 76 images with a resolution of 24 x 224 x 224 were now available for training.
The second approach converted the volumetric data to 2D while using all 41 slices from the class 1 data and all 24 slices from class 2. No samples were thrown away, but the data was no longer volumetric.
\subsubsection{Normalization}
The normalization of images refers to the process of transforming all samples on the same scale of values. Two techniques for this are popular in the field of deep learning. The first one is called feature scaling, where every sample is normalized between 0 and 1.
%\begin{figure}[H]
\[x' = \frac {x - min(x)}{max(x) - min(x)}\]
%\end{figure}
The second technique calculates the standard score, where the mean is subtracted from the samples and then divided by the standard deviation.
%\begin{figure}[H]
\[x' = \frac {x - \mu}{\sigma}\]
%\end{figure}
In this case, the mean and standard deviation are not calculated for every image, but for all training samples. This centers the intensities around the average brightness of the training data. When normalizing validation and test data, one would also use the mean and standard deviation from the training set because those are the measures the architecture is expecting.
The standard score presents a more desirable approach because its mean centering helps to prevent that parameters get abnormally large or get set to zero during training. It is also less sensitive to outliers e.g. extremely bright spots in the image, which frequently appeared in the form of high intensities.
In contrast to the regular use of the standard score, every image was normalized on its own mean and standard deviation instead of calculating it on the entire training data. Class 2 featured roughly five times higher intensities than the class 1 data because they were recorded on different MRI machines. Figure 9 shows two spikes in mean intensities that had to be dealt with. By normalizing each 3D image on its own values, these two distributions were unified. Furthermore, this procedure will normalize future samples on the same scale the network expects no matter what the bit depth of an image is.
\begin{figure}[H]
\centering
\par
\includegraphics[width=1.0\textwidth]{imgs/intensity_distr.png}
\caption{Intensity distribution in the data set}
\par
\end{figure}
This approach also has a downside. The network can no longer use the mean intensity and the standard deviation as a feature to learn from because every image will be identical according to these measures. However, these two features showed no correlation to the accuracy of a prediction, which is why it was decided to use this approach.
\subsubsection{N4 Bias Field Correction}
Due to the inhomogeneous magnetic field used in MRI machines, a bias field is present as a low-frequency non-uniform intensity in the acquired images. Several methods have been developed in the past of which N4ITK is the de facto standard in the field \cite{Tustison2010}. As the name suggests, it is included in the Insight Toolkit (ITK) for which SimpleITK delivered an available Python binding. Using the N4 Bias Field Correction with its default settings on a single 24 x 224 x 224 image took between 60 and 120 seconds.
\begin{figure}[H]
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{imgs/orig.jpg}
\endminipage\hfill
\minipage{0.32\textwidth}
\includegraphics[width=\linewidth]{imgs/corr.jpg}
\endminipage\hfill
\minipage{0.32\textwidth}%
\includegraphics[width=\linewidth]{imgs/diff2.jpg}
\endminipage
\caption{Raw image, corrected image and intensity differences}
\end{figure}
\subsubsection{Augmentation}
Image augmentation is a popular approach to virtually increase the size of the data set. The general idea is that a neural network will overfit more when learning one sample $n$ times, in comparison to learning $n$ alternations of this sample just once. It helps to generalize on new images instead of memorizing patterns in the training data.
Lossless augmentation techniques are those that don't change the values and relative localities in a sample. In 2D these include vertical and horizontal flips, as well diagonal flips if width and height are equal. Note that any multiples of 90-degree rotations can also be created using a combination of these flips. Although the samples are changed as a whole, they do not vary concerning their contained values.
The term "lossless" only refers to the technical change and not necessarily to the semantic change. Vertically flipping a slice of this data set switches the absolute positions of Femur and Tibia. This might make it harder for the network to distinguish between the two.
Lossy augmentation methods include a variety of image transformations. Common choices include:
\begin{itemize}
\item Horizontal shifts
\item Vertical shifts
\item Rotations
\item Shear mapping
\item Brightness adjustments
\item Contrast adjustments
\item Gamma adjustments
\end{itemize}
For this study horizontal flips were implemented to give the impression that images from both knees were available. In addition to this, these images were shifted 24 pixels on the horizontal axis to add another type of augmentation.
Interestingly, these methods did not improve the accuracy of the model. The horizontal flips even hurt the performance and were therefore removed. The horizontal shifts were kept in the training set, so the network would perform better on unseen data that was not perfectly aligned.
\subsubsection{2D and 3D data}
Although convolutional neural networks are most commonly connected with 2D data such as photos or drawings, they can be applied in any dimensional space. In Natural Language Processing (NLP) 1D convolutions are often used on sentences and in finance they can be applied to time series forecasting. In the medical field where a variety of volumetric data exists, 3D convolutions have become a popular choice for building deep learning solutions.
In the context of neural networks, one has to differentiate between volumetric data with one channel and 2D data that consists of multiple color channels. Both are an example of 3D data, but the volumetric images feature three spatial dimensions, whereas photos feature only two. Convolutions commonly traverse the spatial dimensions, which means photos with multiple channels are usually not subject to 3D convolutions. Instead, various input channels are fed into a 2D CNN.
Although the data set consisted of volumetric MRI data, the z-axis showed 20 times fewer voxels than the other two dimensions. Not knowing the influence of this situation both 3D and 2D architectures were investigated for this project. The 3D model didn't use any MaxPooling on the already small z-axis. A kernel size of 3 x 3 x 3 was used similarly to the 3D version of U-Net \cite{Cicek2016}. The 2D data was created by using each slice as a single input image.
As described in 3.3.2 the 3D convolutional network used 24 x 224 x 224 voxel images from both the class 1 and 2 data. In contrast, the 2D network used all of the available slices from each source as separate images.
While in theory, the spatial information of the z-axis gave the 3D network more contextual information of every slice, the 2D model performed better at the end. The latter was also less computationally expensive and allowed working with data where only single slices were available per sample.
\subsubsection{Separate Bone Maps}
The initial segmentation maps came with three separate values for Femur, Tibia, and Fibula. With this information, it was possible to train a network that would segment the three bones while still differentiating between them.
\begin{figure}[H]
\minipage{0.24\textwidth}
\includegraphics[width=\linewidth]{imgs/x1.png}
\endminipage\hfill
\minipage{0.24\textwidth}
\includegraphics[width=\linewidth]{imgs/x2.png}
\endminipage\hfill
\minipage{0.24\textwidth}%
\includegraphics[width=\linewidth]{imgs/x3.png}
\endminipage\hfill
\minipage{0.24\textwidth}%
\includegraphics[width=\linewidth]{imgs/x4.png}
\endminipage
\vspace{0.15cm}
\minipage{0.24\textwidth}
\includegraphics[width=\linewidth]{imgs/y1.png}
\endminipage\hfill
\minipage{0.24\textwidth}
\includegraphics[width=\linewidth]{imgs/y2.png}
\endminipage\hfill
\minipage{0.24\textwidth}%
\includegraphics[width=\linewidth]{imgs/y3.png}
\endminipage\hfill
\minipage{0.24\textwidth}%
\includegraphics[width=\linewidth]{imgs/y4.png}
\endminipage
\caption{Examples of slices with their ground truth segmentations}
\end{figure}
However, this led to problems with the Fibula, which was not recognized by any of the applied architectures. The channel was always predicted empty. Its appearance in the images accounted for 3.4\% of the segmented area, whereas the Tibia made up 41\% and the Femur accounted for the remaining 55.6\%. This led the network to learn that an empty prediction of this channel is valid in 96.6\% of the cases, which is very accurate on its own. Instead of spending capacity on improving a channel that only makes up a tiny fraction of the result, the network improved the performance of Femur and Tibia.
The only successful method was training a separate model for each of the three bones. The network started to predict empty segmentation maps in the beginning because the actual masks only made up a fraction of the frame. The only way to improve from here was to find the Fibula in the non-empty slices and segment its region. This was different from the tests before, where the unsatisfactory performance of the Fibula was balanced by making more accurate segmentations of the other two bones.
For another experiment, the three channels were merged to create a network that would segment any long bone in the image. Not only did this solve the problem with the Fibula as well, but the predictions of Femur and Tibia were just as accurate. The network didn't need to learn any more that the global position in the image accounts for whether or not an object should be segmented. Instead, it could segment any long bone in any position in the frame.
While building the architecture in the following sections, this merged channel approach was used. It helped to speed up the design process because different settings only needed to be trained once instead of once for every bone.
\subsection{Architecture}
The majority of classification CNNs use a recurrent combination of convolutional and pooling layers \cite{Krizhevsky, He2015b, Iandola2016a, Ronneberger2015a}. By combining the local learning patterns of convolutions with the spatial reductions of pooling, the input is compressed to a dense representation of its most important features.
As mentioned in 2.2.5 a segmentation can be seen as a classification for every pixel. As such, early CNN segmentation models used small patches of the input image only to predict a single pixel through a classification pipeline. Afterward, a full segmentation map was assembled using each of these pixels. This process was very slow, and it also prevented the network to have a field of view larger than the inserted patch.
An improvement for segmentations came through architectures referred to as encoder-decoder models. The encoding process describes the same spatial compression used in classification networks. Afterward, in the decoding step, the spatial resolution is brought back to its original shape and further processed by additional convolutions. This yields huge speed improvements, while also increasing the accuracy of the prediction. Encoder-decoder networks are the de facto standard in the current field of image segmentation \cite{Chen2017, Zhao2016, Lin2016, Chen2016, Nekrasov2016, Milletari2016, Cicek2016, Ronneberger2015a, Badrinarayanan2015}, and they were the initial starting point for building architecture.
\begin{figure}[H]
\centering
\par
\includegraphics[width=1.0\textwidth]{imgs/unet.png}
\caption{Visualization of the U-Net architecture}
\par
\end{figure}
Figure 12 shows a particular implementation of an encoder-decoder model known as U-Net \cite{Ronneberger2015a}. It has been successfully used on a variety of projects in and outside the medical field. The contracting side on the left shows a similar architecture to classification CNNs, after which the process is mirrored on the expanding side.
The following subchapters will discuss key elements that go into the design process of such an architecture. It was also analyzed how technologies that were introduced after the U-Net paper could improve the performance on this data set.
\subsubsection{Channels, Growth Rate, and Depth}
The number of parameters in a neural network has a high correlation with its learning capacity. By adding more nodes that can be adjusted during training, the model can approximate a more complex function that transforms the input into the output. The downside is that a larger parameter count will also increase the possibility of overfitting the data. A convention in the field of CNNs is to gradually increase the number of channels, while the spatial resolution is reduced due to the use of MaxPooling \cite{Krizhevsky}\cite{He2015b}\cite{Iandola2016a}\cite{Ronneberger2015a}. U-Net also shows this behavior on the left side of its architecture.
A test was set up that compared the original U-Net against five smaller variants. They varied in the total number of parameters and how the number of channels changed from one layer to the next.
\begin{table}[H]
\centering
\begin{tabular}{| l | r | r | r | r | r | r | r | r | r | r |}
\hline
Name & C1 & C2 & C3 & C4 & C5 & C6 & C7 & C8 & C9 & Para. \\
\Xhline{3\arrayrulewidth}
Model A & 32 & 32 & 32 & 32 & 32 & 32 & 32 & 32 & 32 & 195k \\
\hline
Model B & 16 & 24 & 36 & 54 & 81 & 54 & 36 & 24 & 16 & 332k \\
\hline
Model C & 48 & 48 & 48 & 48 & 48 & 48 & 48 & 48 & 48 & 437k \\
\hline
Model D & 8 & 16 & 32 & 64 & 128 & 64 & 32 & 16 & 8 & 491k \\
\hline
Model E & 32 & 48 & 72 & 108 & 159 & 108 & 72 & 48 & 32 & 1.33m \\
\hline
U-Net & 64 & 128 & 256 & 512 & 1024 & 512 & 256 & 128 & 64 & 31.0m \\
\hline
\end{tabular}
\caption{Convolutional blocks with their respective output channels}
\end{table}
Training U-Net was very slow in two regards. Each training step lasted six times longer compared to the smallest model, while it also took 25 times more training steps to reach the same results. Since it was also overfitting to a significant amount, it was not trained until the end. The second largest model E also overfitted on the training data and never achieved comparable results to the other four models.
Model D started with eight channels which were then increased by a factor of 2, whereas B started out larger but only increased the channels by a rate of 1.5. Although being the smaller model, B showed a higher accuracy in the end. It was concluded that the initial number of output channels has a larger effect on the result than the growth rate.
This theory was supported by the results from A and C, both of which kept their initial number of channels throughout the network. These two showed the highest scores in comparison to the other models while maintaining their relatively small size. Since their results were identical, model A was kept because of its faster training speed. It was interesting to see that the smallest model performed best across the candidates.
The original U-Net featured a depth of 5 convolutional blocks until reaching the bottom of the "U"-shape. It was also investigated that a depth of 6 would not improve performance, while a depth of 4 decreased the accuracy. Therefore, Model A was left unchanged.
\subsubsection{Skip Connections}
U-Net uses skip connections to copy values from the left side to the corresponding layer on the right side. This improves performance because otherwise spatial information would get lost while contracting the input. An experiment showed that removing these paths indeed hurt the performance badly.
ResNet \cite{He2015b} is another popular classification architecture that was the first to use residual connections. These are another type of skip connection that bridge the beginning and end of a convolutional block. According to the original paper, they are also one of the main reasons CNNs can be expanded to depths of thousands of layers. Since their use in the industry has found great success in many models, they were also applied in this network. However, they did not have an impact on the performance or the training speed, which is why they were not used in the end.
\subsubsection{Dropout}
Dropout is a popular regularization technique that randomly zeros out a fraction of the nodes during training \cite{Srivastava2014}. It is understood that this helps the model to generalize better and reduce overfitting on a given data set. Well known image classification architectures like VGG16 \cite{Simonyan2014a}, SqueezeNet \cite{Iandola2016a} or AlexNet \cite{Krizhevsky} use Dropout near the end of the network. Similarly, U-Net uses Dropout at the end of the contracting path to prevent overfitting.
Since a single unit of dropout with a rate of 0.5 is common in other architectures, this was also chosen as a first option. Other tests included adding multiple dropout units between the convolutions on the contracting side, which led to slower training and lower scores. Adding dropout on the expanding side is rather unusual and also didn't perform well in the tests. In the end, the first candidate was chosen for future training.
\subsubsection{Activation Functions}
Activation functions sit between layers in a network to introduce a non-linearity. Otherwise, the operations could be regenerated to a simple linear transformation and remove the benefits of building a deep model. The rectified linear unit or ReLU \cite{Nair} is the default recommendation for an activation function in modern neural networks. It's defined as the maximum of 0 and the input value and can be described as a nonlinear function made up of two linear pieces. Because of this, it preserves many of the properties that make models easy to optimize and generalize well on new data \cite{Goodfellow2016}.
\[
relu(x) =
\begin{cases}
0 & \text{if } x < 0 \\
x & \text{if } x \geq 0
\end{cases}
\]
\[
elu(x) =
\begin{cases}
e^x - 1 & \text{if } x < 0 \\
x & \text{if } x \geq 0
\end{cases}
\]
Since the introduction of ReLU other variants have built on its success like PReLU \cite{He2015a}, LReLU and ELU \cite{Clevert2015}. The latter titled exponential linear unit shows the same linear behavior for positive values but adds an exponential function to negative inputs instead of erasing its value. This has shown further improvements by keeping the mean of values centered at 0 while only being slightly more computationally expensive. ELU showed superior results in comparison to ReLU and was chosen for all future tests.
\begin{figure}[H]
\centering
\par
\includegraphics[width=0.53\textwidth]{imgs/elu_relu5.png}
\caption{Visualization of ReLU and ELU}
\par
\end{figure}
\subsection{Training}
The training is where the actual learning takes place, and its process is inspired by how humans improve their performance on tasks. When opposed to a difficult question the first step would be a mere guess of what the answer might be. In a second phase, this information is compared to the right answer and processed by the brain accordingly.
Similarly, the training of a neural network consists of two parts. In the first step, the input is fed forward through the network, and an answer prediction is made. In the very beginning, this is just a random guess because ANNs are initialized with random values.
\begin{figure}[H]
\centering
\par
\includegraphics[width=1.0\textwidth]{imgs/forward_backward_prop.png}
\caption{The iterative process of forward and backward propagation}
\par
\end{figure}
The prediction is then compared to the right answer, and the error is calculated. This metric can be fed back through the network to optimize the parameters in the model. These two steps are called forward and backward propagation.
\subsubsection{Metrics and Loss Functions}
Metrics are used in deep learning to measure the performance of a model. For example, the accuracy is often chosen to describe how well a neural network is doing on a classification task. An accuracy of 0.9 indicates that 9 out of 10 samples are classified correctly.
\[
\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}
\]
In the formula above T and F indicate whether a prediction was true or false. P and N stand for a positive or negative outcome.
The result of a loss function is a metric that will be minimized during the backward propagation process. It needs to be a differentiable function, which is why the accuracy cannot be used as a loss function. It is a binary metric that works with true or false values and not with probabilities.
In situations like these, a surrogate function is used that has a high correlation with the target metric. For classification problems, this is often the cross entropy. Because a segmentation can be seen as a classification for every output pixel, it was also chosen as a candidate for this study.
\[
\text{Cross Entropy} \ (p, q) = -\sum \ p(x) \ log \ q(x)
\]
Another option was the $F_1$ score, which is a particular implementation of the F-Measure when $ \beta $ is 1. It is also known as the dice similarity coefficient (DSC).
\[
F_\beta \ \text{Score}= \frac{(1 + \beta^2) TP}{(1+\beta^2)TP+\beta^2FN+FP}
\]
Although the $F_1$ score is commonly applied as a binary measure and therefore not differentiable, a "soft" version can be used that accounts for continuous probabilities. To use it as a loss function where 0 describes the best possible outcome the $F_1$ loss was defined as $ 1 - F_1 \ \text{score} $.
\[
F_\beta \ \text{Loss} = 1 - F_\beta \ \text{Score}
\]
The value of $ \beta $ can be adjusted to change the emphasize between precision and recall. Precision describes how much of the predicted area was true, whereas recall describes how much of the ground truth was recognized by the model. This can be useful for data sets with large imbalances between classes, like this study showed between Fibula, Femur, and Tibia.
The $F_2$ score was tried for the three channel segmentation, where the Fibula could not be recognized. Although this model showed better scores regarding the recall measure, the Fibula was still not segmented.
\[
\text{Precision} = \frac{TP}{TP+FP}
\]
\[
\text{Recall} = \frac{TP}{TP+FN}
\]
\[
\text{Total Error} = \frac{FP+FN}{TP+TN+FP+FN}
\]
Both loss functions were used in separate runs to compare the results of the $F_1$ score and the cross entropy. The $F_1$ trained model showed better results regarding the $F_1$ score and the cross entropy model showed better results on the cross entropy loss. In regards to Precision, Recall, Accuracy, and Intersection-over-Union (IoU) the performance of the $F_1$ trained network showed higher scores.
\[
\text{IoU}= \frac{TP}{TP+FP+FN}
\]
Another test run used both loss function at the same time. This was done by feeding the output of the next-to-last layer into two separate segmentation outputs. Each of these last layers was trained with the $F_1$ loss and cross entropy loss respectively. All previous layers were therefore trained by the sum of both loss functions. This architecture matched the best results of the previous runs in a single model but also increased the training time by 30\%.
Evaluating these two outputs on the other metrics played in favor for the $F_1$ loss output. Combining both maps didn't improve the scores any further. Since the cross entropy was chosen as a surrogate function and not as a performance metric, all future tests used only the $F_1$ loss. This delivered faster training than the dual output method.
\subsubsection{Optimizer}
The previous chapter provided an overview of the loss function, which measures the performance of a prediction during the training process. This chapter discusses how the result can be used to execute the actual optimization step.
The derivative of a single variable function defines the slope of this function at any given point. Knowing this, one can tell in which direction the original function declines. Adjusting the independent variable along the descent of a loss function will minimize the error in respect to its dependent variable.
The gradient is the derivative for functions of multi-dimensional inputs, such as the loss functions used in deep learning. A process that minimizes its result is called gradient descent. While it is possible to determine its minimum analytically, it is intractable for artificial neural networks due to the high number of parameters \cite{Chollet2017}.
Instead, the stochastic gradient descent (SGD) is used which will take a random batch of the training data and iteratively adjust the parameters in small steps. SGD is the basis for all common ANNs, but over the years different variants were introduced to the field.
The Adam optimizer \cite{Kingma2014} is such a variant, which enhances SGD amongst other things by using what's called an "adaptive momentum estimation." This is also where its name derives from. It adaptively adjusts the learning rate which defines how much the parameters will be changed in one training step. By incorporating the previous and current shape of the slope, Adam can tremendously speed up the training.
\subsubsection{Batch Size}
The number of random samples per training step is referred to as the batch size. In the past, it was believed that larger batches led to something called the generalization gap \cite{Keskar2016}, where the accuracy of a model would drop if it was trained on unusually large batches. Recent work \cite{Hoffer2017} suggests other reasons for this decrease in accuracy. While common batch sizes range from 32 to 256, Goyal et al. showed accurate results using 8192 images per batch when training a model on ImageNet \cite{Goyal2017}.
Using batches smaller than 32 samples can introduce a different kind of problem. Having too few data points that don't represent the mean of the data well, may lead to slow and unstable training.
Based on hardware limitations the largest possible batch sizes ranged from 24 to 48 samples depending on the current architecture. Whatever could be fit into memory was used for these experiments. Exceptions occurred when working with 3D convolutions in which case the batch size had to be limited to just 4 samples.
\subsubsection{Learning Rate Policy}
One full iteration over the training samples is referred to as an epoch, and the learning rate policy describes how the learning rate is changed from one epoch to another. With the introduction of adaptive optimizers like Adam, there has been a lower emphasize on this topic because the learning rate is modified during training. Even though this reduces the number of possible defects, training time can often be saved with the right initial learning rate.
\newpage
Ten epochs were run at different learning rates to compare initial results and to examine the point at which the model wouldn't converge at all. 0.002 was the highest rate at which the model started training, but 0.001 resulted in the best score. Adding a decay that reduces the learning rate manually over time did not improve the results with the use of Adam.
\subsubsection{Early Stopping}
Neural networks will continuously minimize the loss on the training set. This result needs to be validated on data the network hasn't seen before. At a certain point during training, the performance on the validation set will start to decrease because the model is overfitting on the training data. The number of iterations to reach this point is dependent on many hyperparameters, as well as the random values the network has been initialized with. As such, it's difficult to calculate how long the training will need to reach its peak.
Early stopping is a simple technique that will end the training process as soon as the model stops improving on the validation data. To do this, a patience value is defined how long the network should continue training after the score has stopped increasing. This is important because not every epoch will lead to a new best score on the validation data.
For test runs in this study, a patience of 9 was selected, which meant the training would stop after 10 epochs without improvement. Depending on the architecture and other hyperparameters this point was reached after 30 to 60 epochs. For the last training with the final set of hyperparameters, the patience was increased to 19, which didn't improve the accuracy. This was also a verification that the initial value of 9 was a good fit for this problem.
\subsection{Postprocessing}
After the training had finished, it was investigated how the results on the validation data could be improved any further. Since the predicted segmentations showed a small number of values between 0 and 1, a threshold of 0.3 showed the best results to create a fully binary map. Furthermore, any 2D slice that showed a prediction smaller than 2\% of the total frame was predicted as empty instead. This helped against the false prediction of tiny fractions in upper and lower layers. For the segmentation of the Fibula alone, this last step was not executed.
\subsection{Summary}
In summary, the architecture features 4 "Down Blocks" on the contracting side that are mirrored through 4 "Up Blocks" on the expanding side. Each level is connected to the other side to share information at the current spatial resolution.
\begin{figure}[H]
\centering
\par
\includegraphics[width=0.7\textwidth]{imgs/model.png}
\caption{The proposed architecture for the segmentation}
\par
\end{figure}
A Down Block consists of 2 convolutional layers each followed by an ELU activation function, after which A MaxPooling layer halves the width and height of the image. The Up Block looks similar but features a transposed convolutional layer in the beginning to upscale the resolution and no MaxPooling in the end. The Middle Block features a Dropout layer with a value of 0.5, which is surrounded by convolutional layers as well. All convolutional layers have 32 channels. The output layer includes another convolution that reduces the channels to 1, which represents the segmentation map.
\newpage
Other options that were tried but didn't improve performance included using convolutional blocks with a stride of 2 instead of MaxPooling to down sample the images. Batch normalization was also applied but resulted in a high amount of overfitting the network never recovered from. The fact that it did not improve the performance came as a surprise because it usually improves any model. A possible reason for this could be the small batch size in combination with a high variance of intensities in the data. If the mean of each batch varies significantly from the mean of the entire training data, it may hurt the performance of batch normalization.
\newpage