Diff of /thesis/4_results.tex [000000] .. [7e66db]

Switch to unified view

a b/thesis/4_results.tex
1
\section{Results}
2
3
The results in this chapter are based on data the network has not been trained with. It was also made sure that there was no overlap in images of people that were recorded multiple times. Five images were chosen randomly from each of the two sources in the very beginning. Since the Epi data featured 41 slices and the Jopp data featured 24 slices, 325 2D samples were available for evaluation. Each slice contained 50,176 pixel values, making a total of 16.3 million predictions that were analyzed.
4
5
The network architecture was developed using a single segmentation channel that merged Femur, Tibia and Fibula maps. However, tests were also run to validate the performance for each bone on its own and also combine the three separate predictions into a multi channel segmentation. The same architecture and training procedure was used for this task.
6
7
\subsection{Numeric Evaluation}
8
9
The proposed model achieves a DSC score of 98.0\% and an IoU of 96.0\%. Precision and Recall are perfectly balanced, suggesting that predictions are neither too optimistic or pessimistic. The error shows a small value of 1.2\%.
10
11
\begin{table}[H]
12
    \centering
13
    \begin{tabular}{| l | c | c | c | c | c |}
14
    \hline
15
           & DSC & IoU & Precision & Recall & Error \\ 
16
    \Xhline{3\arrayrulewidth}
17
    Merged & 0.980 & 0.960 & 0.980 & 0.980 & 0.012 \\
18
    \hline
19
    Femur & 0.981 & 0.963 & 0.979 & 0.984 & 0.006 \\
20
    \hline
21
    Tibia & 0.977 & 0.955 & 0.976 & 0.977 & 0.006 \\
22
    \hline
23
    Fibula & 0.953 & 0.911 & 0.954 & 0.952 & 0.001 \\
24
    \hline
25
    Combined & 0.979 & 0.958 & 0.977 & 0.981 & 0.004 \\
26
    \Xhline{3\arrayrulewidth}
27
    Femur \cite{Dodin2011} & 0.940 & - & - & - & - \\
28
    \hline
29
    Tibia \cite{Dodin2011} & 0.920 & - & - & - & - \\
30
    \hline
31
    Tibia \cite{Dam} & 0.975 & - & - & - & - \\
32
    \hline
33
    \end{tabular}
34
    \caption{Numeric evaluation of segmentations}
35
\end{table}
36
37
Results on Femur and Tibia alone are comparable to the merged approach, whereas the Fibula segmentation shows lower scores. This could be because the Fibula is only visible in a minority of slices, making it a somewhat unbalanced task. Combining the three separate segmentations to a single model gives comparable results as well. The error is reduced by a factor of 3, which is expected because the channels are increased by 3. Compared to previous studies, the proposed model shows slightly better results than the multi-atlas segmentation and significantly higher scores than the ray casting technique.
38
39
\begin{figure}[H]
40
  \includegraphics[width=\linewidth]{imgs/train_val.png}
41
\caption{Visual representation of the performance during training}
42
\end{figure}
43
44
The graph above shows the performance of the merged model during the learning process on the training and validation data. 
45
46
\begin{table}[H]
47
    \centering
48
    \begin{tabular}{| l | c | c | c | c | c |}
49
    \hline
50
    Data       & Epoch 1 & Epoch 5 & Epoch 10 & Epoch 20 & Epoch 40 \\ 
51
    \Xhline{3\arrayrulewidth}
52
    Training Set   & 0.753   & 0.949   & 0.966    & 0.975    & 0.981 \\
53
    \hline
54
    Validation Set & 0.847   & 0.956   & 0.963    & 0.976    & 0.979 \\
55
    \hline
56
    Test Set       &     -   &     -   &     -    &     -    & 0.980 \\
57
    \hline
58
    \end{tabular}
59
    \caption{Performance of the merged model on different sets over time}
60
\end{table}
61
62
The 98\% mark is never reached on the validation data, but only on the final evaluation of the test set. Towards the end, the performance of the training data pulls slightly ahead of the validation results.
63
64
\subsection{Visual Evaluation}
65
66
The visual evaluation will focus on the results of the combined network since its performance is on par with the one channel model while offering more information about the associated bones.
67
68
The network shows excellent performance on slices that are located near the middle of the 3D MRIs often with DSC scores of over 99\%.
69
70
\begin{figure}[H]
71
\minipage{0.24\textwidth}
72
  \includegraphics[width=\linewidth]{imgs/a1.png}
73
\endminipage\hfill
74
\minipage{0.24\textwidth}
75
  \includegraphics[width=\linewidth]{imgs/b1.png}
76
\endminipage\hfill
77
\minipage{0.24\textwidth}%
78
  \includegraphics[width=\linewidth]{imgs/c1.png}
79
\endminipage\hfill
80
\minipage{0.24\textwidth}%
81
  \includegraphics[width=\linewidth]{imgs/d1.png}
82
\endminipage
83
\vspace{0.15cm}
84
\minipage{0.24\textwidth}
85
  \includegraphics[width=\linewidth]{imgs/a2.png}
86
\endminipage\hfill
87
\minipage{0.24\textwidth}
88
  \includegraphics[width=\linewidth]{imgs/b2.png}
89
\endminipage\hfill
90
\minipage{0.24\textwidth}%
91
  \includegraphics[width=\linewidth]{imgs/c2.png}
92
\endminipage\hfill
93
\minipage{0.24\textwidth}%
94
  \includegraphics[width=\linewidth]{imgs/d2.png}
95
\endminipage
96
\vspace{0.15cm}
97
\minipage{0.24\textwidth}
98
  \includegraphics[width=\linewidth]{imgs/a3.png}
99
\endminipage\hfill
100
\minipage{0.24\textwidth}
101
  \includegraphics[width=\linewidth]{imgs/b3.png}
102
\endminipage\hfill
103
\minipage{0.24\textwidth}%
104
  \includegraphics[width=\linewidth]{imgs/c3.png}
105
\endminipage\hfill
106
\minipage{0.24\textwidth}%
107
  \includegraphics[width=\linewidth]{imgs/d3.png}
108
\endminipage
109
\vspace{0.15cm}
110
\minipage{0.24\textwidth}
111
  \includegraphics[width=\linewidth]{imgs/a4.png}
112
\endminipage\hfill
113
\minipage{0.24\textwidth}
114
  \includegraphics[width=\linewidth]{imgs/b4.png}
115
\endminipage\hfill
116
\minipage{0.24\textwidth}%
117
  \includegraphics[width=\linewidth]{imgs/c4.png}
118
\endminipage\hfill
119
\minipage{0.24\textwidth}%
120
  \includegraphics[width=\linewidth]{imgs/d4.png}
121
\endminipage
122
\caption{Input, prediction, difference to ground truth and applied mask (from left to right)}
123
\end{figure}
124
125
Perfect DSC scores of 100\% are achieved through empty segmentations that the network correctly predicted as such.
126
127
\begin{figure}[H]
128
\minipage{0.24\textwidth}
129
  \includegraphics[width=\linewidth]{imgs/a5.png}
130
\endminipage\hfill
131
\minipage{0.24\textwidth}
132
  \includegraphics[width=\linewidth]{imgs/b5.png}
133
\endminipage\hfill
134
\minipage{0.24\textwidth}%
135
  \includegraphics[width=\linewidth]{imgs/a6.png}
136
\endminipage\hfill
137
\minipage{0.24\textwidth}%
138
  \includegraphics[width=\linewidth]{imgs/b6.png}
139
\endminipage
140
\caption{Correctly segmented upper and lower slices}
141
\end{figure}
142
143
The most imprecise results showed a DSC of 0\% where ground truth and prediction did not overlap at all.
144
145
\begin{figure}[H]
146
\minipage{0.32\textwidth}
147
  \includegraphics[width=\linewidth]{imgs/a8.png}
148
\endminipage\hfill
149
\minipage{0.32\textwidth}
150
  \includegraphics[width=\linewidth]{imgs/b8.png}
151
\endminipage\hfill
152
\minipage{0.32\textwidth}%
153
  \includegraphics[width=\linewidth]{imgs/c8.png}
154
\endminipage
155
\vspace{0.2cm}
156
\minipage{0.32\textwidth}
157
  \includegraphics[width=\linewidth]{imgs/a7.png}
158
\endminipage\hfill
159
\minipage{0.32\textwidth}
160
  \includegraphics[width=\linewidth]{imgs/b7.png}
161
\endminipage\hfill
162
\minipage{0.32\textwidth}%
163
  \includegraphics[width=\linewidth]{imgs/c7.png}
164
\endminipage
165
\caption{Input, ground truth and prediction of false segmentations}
166
\end{figure}
167
168
In these examples, the ground truth defines the Femur to be visible in the upper image but not in the one below it. The prediction disagrees with it in both cases. These are two examples that could be seen as noise in the ground truth data or to be debatable at least.
169
170
\subsection{Model Exploration}
171
172
Neural networks are often considered black boxes because it is difficult to get an intuitive understanding how decisions are calculated. This is a crucial problem especially in the medical field, where a prediction may approximate certain health conditions about a patient.
173
174
Two factors help in this study to pull away from this problem. Firstly, a segmentation is an image to image pipeline making it less abstract what kind of transformations are happening. Visualizing input and output gives an intuitive understanding how one may have emerged out of the other.
175
176
Secondly, convolutional neural networks take away most of this black box reasoning \cite{Chollet2017} because every layer in the network can be visualized. The following images give an insight what a sample looks like after different convolutional blocks.
177
178
\begin{figure}[H]
179
  \centering
180
  \includegraphics[width=1.0\textwidth]{imgs/archi_explo.png}
181
\caption{Intermediate sum of feature maps throughout the network}
182
\end{figure}
183
184
Each image shows the sum of all intermediate channels at a particular layer. The first image is the raw input, whereas the last image is the final segmentation map. One can clearly see the reduction of resolution towards the middle, which is then brought back up. The first convolutional block inverses the input and removes most of the bone structure except for the growth plates. The second to last output looks similar to the input, but the skin on the sides is almost entirely removed, and the dark growth plates have been filled. Looking closer at this image one can also notice a crisp black line that separates the bone from the rest. A simple threshold at this point would segment the bone fairly accurate.
185
186
Even at this level, it is difficult to judge what exactly the network is detecting. For another visualization, the intermediate channels are not added like before but analyzed separately.
187
188
\begin{figure}[H]
189
\minipage{0.24\textwidth}
190
  \includegraphics[width=\linewidth]{imgs/channel1.png}
191
\endminipage\hfill
192
\minipage{0.24\textwidth}
193
  \includegraphics[width=\linewidth]{imgs/channel3.png}
194
\endminipage\hfill
195
\minipage{0.24\textwidth}%
196
  \includegraphics[width=\linewidth]{imgs/channel2.png}
197
\endminipage\hfill
198
\minipage{0.24\textwidth}%
199
  \includegraphics[width=\linewidth]{imgs/channel4.png}
200
\endminipage
201
\caption{Examples of intermediate channels throughout the network}
202
\end{figure}
203
204
These are a few notable examples in the model. The first filter detects high frequencies in the bone and skin tissue. Filter 2 finds vertical edges while filter 3 shows horizontal ones. Finally, filter 4 appears to be a growth plate detector. It does not just detect horizontal edges, but only those inside the bone. This seems reasonable because the network needs to learn that the dark growth plates should not be interpreted as edges.
205
206
\subsection{Noise Exploitation}
207
208
Ground truth data is often subject to noise, because mistakes happen during their creation or because it may be debatable what the actual truth is. To test how well the network performs on noisy data, another training environment was set up where synthetic noise was added to the ground truth segmentations. By using erosion and dilation with a kernel size of 7 both the training and validation sets were modified.
209
210
\begin{figure}[H]
211
\minipage{0.24\textwidth}
212
  \includegraphics[width=\linewidth]{imgs/orig_seg1.png}
213
\endminipage\hfill
214
\minipage{0.24\textwidth}
215
  \includegraphics[width=\linewidth]{imgs/noisy_seg1.png}
216
\endminipage\hfill
217
\minipage{0.24\textwidth}%
218
  \includegraphics[width=\linewidth]{imgs/orig_seg2.png}
219
\endminipage\hfill
220
\minipage{0.24\textwidth}%
221
  \includegraphics[width=\linewidth]{imgs/noisy_seg2.png}
222
\endminipage
223
\caption{Examples of synthetic noise using erosion and dilation}
224
\end{figure}
225
226
The first and third images show the original segmentations, while two and four were changed using erosion and dilation respectively. The noise on the training data accounted for an error of 7.1\% DSC. The predictions show an accuracy of 95.2\% DSC on the unmodified test set, meaning that the noise hurts the performance of the model. More importantly, the DSC error is lower than the error that was introduced by the noise. When applying the same noise on the test set and using it as a prediction, the score is 92.8\% DSC. The predictions of the network contain 33\% less noise than the data it was trained on and show the robustness against inaccurate labels.
227
228
\subsection{Transfer Application}
229
230
Neural networks are known to be unreliable when used on data that exceeds the range of variation in the training set. They cannot learn what they were not taught. On the other hand, convolutions are translation invariant, allowing them to recognize patterns anywhere in the frame \cite{Chollet2017}. Another merged model was trained that used the same architecture but added further image augmentation of horizontal and vertical flips. This way it only reached a DSC score of 97.3\% but was more flexible to structural differences in the image.
231
232
The proposed architecture can be described as "fully convolutional" because it does not include any dense layers. This allows the network to accept any resolution and still being able to process it. An experiment was set up that used uncropped images from the Epi data which were resized to 448 x 448 pixels. This made them four times larger than the images the network was trained with.
233
234
\begin{figure}[H]
235
\minipage{0.24\textwidth}
236
  \includegraphics[width=\linewidth]{imgs/transfer_size_x1.png}
237
\endminipage\hfill
238
\minipage{0.24\textwidth}
239
  \includegraphics[width=\linewidth]{imgs/transfer_size_x2.png}
240
\endminipage\hfill
241
\minipage{0.24\textwidth}%
242
  \includegraphics[width=\linewidth]{imgs/transfer_size_x3.png}
243
\endminipage\hfill
244
\minipage{0.24\textwidth}%
245
  \includegraphics[width=\linewidth]{imgs/transfer_size_x4.png}
246
\endminipage
247
\vspace{0.15cm}
248
\minipage{0.24\textwidth}
249
  \includegraphics[width=\linewidth]{imgs/transfer_size_y1.png}
250
\endminipage\hfill
251
\minipage{0.24\textwidth}
252
  \includegraphics[width=\linewidth]{imgs/transfer_size_y2.png}
253
\endminipage\hfill
254
\minipage{0.24\textwidth}%
255
  \includegraphics[width=\linewidth]{imgs/transfer_size_y3.png}
256
\endminipage\hfill
257
\minipage{0.24\textwidth}%
258
  \includegraphics[width=\linewidth]{imgs/transfer_size_y4.png}
259
\endminipage
260
\caption{Examples of uncropped and 448 x 448 pixel predictions}
261
\end{figure}
262
263
The predictions show good results even following the shaft of the bone which wasn't visible in the original cropped images. Even large intensity gaps through growth plates are recognized as Femur or Tibia. The recall performance is very accurate. Problems are visible in tissue that isn't bone but was recognized as such as the 4th example shows.
264
265
In 3.2 a third data source was mentioned that featured five sagittal recordings of knees. These were not used for the training because of their structural differences and because no ground truth segmentations were available. The following samples show what happens when the network is applied to images from a different perspective.
266
267
\begin{figure}[H]
268
\minipage{0.24\textwidth}
269
  \includegraphics[width=\linewidth]{imgs/transfer_pers_x1.png}
270
\endminipage\hfill
271
\minipage{0.24\textwidth}
272
  \includegraphics[width=\linewidth]{imgs/transfer_pers_x2.png}
273
\endminipage\hfill
274
\minipage{0.24\textwidth}%
275
  \includegraphics[width=\linewidth]{imgs/transfer_pers_x3.png}
276
\endminipage\hfill
277
\minipage{0.24\textwidth}%
278
  \includegraphics[width=\linewidth]{imgs/transfer_pers_x4.png}
279
\endminipage
280
\vspace{0.15cm}
281
\minipage{0.24\textwidth}
282
  \includegraphics[width=\linewidth]{imgs/transfer_pers_y1.png}
283
\endminipage\hfill
284
\minipage{0.24\textwidth}
285
  \includegraphics[width=\linewidth]{imgs/transfer_pers_y2.png}
286
\endminipage\hfill
287
\minipage{0.24\textwidth}%
288
  \includegraphics[width=\linewidth]{imgs/transfer_pers_y3.png}
289
\endminipage\hfill
290
\minipage{0.24\textwidth}%
291
  \includegraphics[width=\linewidth]{imgs/transfer_pers_y4.png}
292
\endminipage
293
\caption{Examples of sagittal class 3 data segmentations}
294
\end{figure}
295
296
These results look very accurate. In example 4 it even recognizes the Patella, a bone it was never trained on. This shows that with enough image augmentation the network will learn to segment any bone.
297
298
\subsection{Proof of Concept: Age Assessment}
299
300
The initial cause for the segmentation experiment was to reduce the amount of information in a knee MRI. The resulting images could then be used to make age assessments that focussed on the bone and the growth plate. This section will briefly cover such an age prediction pipeline.
301
302
The age of the candidates ranged from 14 to 21 years with a mean at 17.5 years. Predicting this age for every person meant that it was never off more than 3.5 years. Since the data was normally distributed, a static prediction of the mean led to a mean difference of 1.2 years.
303
304
Even by using many of the techniques mentioned in previous chapters, it was not possible to train a stable model that could beat this baseline on the raw input data. After the segmentation maps had been applied to all 145 3D samples, many of the slices were completely empty because they did not contain any of the three bones. It was decided to only use the middle 18 slices from each of the two sources. Even with these changes, the network would not converge.
305
306
Only after using the contracting side of the proposed model with the parameters it learned on the segmentation task was it possible to train a network in a stable manner. This approach did not make the model converge every time, but when it did its predictions were stable through multiple epochs. A global average pooling layer and a dense layer were added after the convolutional block in the middle. The output was a single continuous value representing the age of the individual.
307
308
The results on the validation data showed a mean difference of 0.64 $\pm$ 0.48 years for all of the slices. Since every 3D sample now had 18 different predictions, another random forest regressor was built that would take a vector of 18 values and predict a single age. This resulted in a mean difference of 0.56 $\pm$ 0.40 years on the validation data, and a final run on the test set showed 0.48 $\pm$ 0.32 years. The most accurate assessment of the 14 test samples was off by 0.07 years, and the worst estimate showed a difference of 1.37 years.
309
310
These results show higher accuracy than the work of Stern et al. \cite{Stern2014} using carpal MRIs which led to a mean difference of 0.85 $\pm$ 0.58 years. Their approach extracted 11 growth plates in the left hand of candidates and then mapped these features to an age estimate. It was not based on convolutional neural networks but a random forest regressor for both stages. Their data set was equally distributed resulting in a higher standard deviation for the age of candidates. However, their worst age estimate was off by 2.35 years in comparison to the 1.37 years mentioned above. This suggests that the normal distribution of this study is not the only cause for better results.
311
312
\newpage