Diff of /README.md [000000] .. [d735e2]

Switch to unified view

a b/README.md
1
## Intracranial Hemorrhage Detection
2
3
This blog post is about the challenge that is hosted on kaggle on [RSNA Intracranial Hemorrhage Detection](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection). 
4
5
This post is divided into following parts
6
7
1. Overview
8
2. Basic EDA [Ipython Notebook](https://www.kaggle.com/suryaparsa/rsna-basic-eda-part-1)
9
3. Data Visualization & Preprocessing
10
4. Deep Learning Model
11
5. Demo
12
13
### 1. Overview
14
15
##### What is Intracranial Hemorrhage?
16
17
An intracranial hemorrhage is a type of bleeding that occurs inside the skull. Symptoms include sudden tingling, weakness, numbness, paralysis, severe headache, difficulty with swallowing or vision, loss of balance or coordination, difficulty understanding, speaking , reading, or writing, and a change in level of consciousness or alertness, marked by stupor, lethargy, sleepiness, or coma. Any type of bleeding inside the skull or brain is a medical emergency. It is important to get the person to a hospital emergency room immediately to determine the cause of the bleeding and begin medical treatment. It rquires highly trained specialists review medical images of the patient’s cranium to look for the presence, location and type of hemorrhage. The process is complicated and often time consuming. So as part of this we will be deep learning techniques to detect acute intracranial hemorrhage and its subtypes.
18
19
Hemorrhage Types
20
21
1. Epidural
22
2. Intraparenchymal    
23
3. Intraventricular
24
4. Subarachnoid 
25
5. Subdural
26
6. Any
27
28
##### What am i predicting?
29
30
In this competition our goal is to predict intracranial hemorrhage and its subtypes. Given an image the we need to predict probablity of each subtype. This indicates its a multilabel classification problem.
31
32
##### Evaluation Metric
33
34
Competition evaluation metric is **weighted log loss** but weights for each subtype is not disclosed as part of the competition but in the discussion forms some of the teams found it out that the any label has a weight of 2 compared to other subtypes, you can check more details [here](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/discussion/109526#latest-630190). But as part of this tutorial i'm going to use normal accuracy as evaluation metric and loss as **binary cross entropy loss** and checkpointing the models based on the loss.
35
36
37
### 2. Basic EDA 
38
39
Lets look at the [data](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/data) that is provided.
40
41
We have a train.csv containing file names and label indicating whether hemorrhage is present or not and train images folder which is set of [Dicom](https://www.dicomstandard.org/) files (Medical images are stored in dicom formats) and test images folder containing test dicom files.
42
43
```python
44
# load the csv file
45
train_df = pd.read_csv(input_folder + 'stage_1_train.csv')
46
train_df.head()
47
```
48
<img src='assets/df.png'/>
49
          
50
It consists of two columns ID and Label. ID has a format FILE_ID_SUB_TYPE for example ID_63eb1e259_epidural so ID_63eb1e259 is file id and epidural is subtype and Label indicating whether subtype hemorrhage is present or not.
51
52
Lets seperate file names and subtypes
53
54
```python
55
# extract subtype
56
train_df['sub_type'] = train_df['ID'].apply(lambda x: x.split('_')[-1])
57
# extract filename
58
train_df['file_name'] = train_df['ID'].apply(lambda x: '_'.join(x.split('_')[:2]) + '.dcm')
59
train_df.head()
60
```
61
<img src='assets/df2.png'/>
62
63
64
```python
65
train_df.shape
66
````
67
Output : (4045572, 4)
68
69
```python
70
print("Number of train images availabe:", len(os.listdir(path_train_img)))
71
```
72
Output : Number of train images availabe: 674258
73
74
The csv file has a shape of (4045572, 4). For every file(dicom file) present in the train folder has 6 entries in csv indicating possible 6 subtype hemorrhages.
75
76
Lets check the files available for each subtype
77
78
```python
79
plt.figure(figsize=(16, 6))
80
graph = sns.countplot(x="sub_type", hue="Label", data=(train_df))
81
graph.set_xticklabels(graph.get_xticklabels(),rotation=90)
82
plt.show()
83
```
84
<img src='assets/counts.png'/>
85
86
87
Lets check the counts for each subtype
88
89
##### Epidural
90
91
```python
92
train_df[train_df['sub_type'] == 'epidural']['Label'].value_counts()
93
```
94
Output: 
95
96
0    671501
97
98
1      2761
99
100
Name: Label, dtype: int64
101
102
For epidural sub type we have 6,71,501 images labeled as 0 and 2,761 labelled as 1.
103
104
##### Intraparenchymal
105
106
```python
107
train_df[train_df['sub_type'] == 'intraparenchymal']['Label'].value_counts()
108
```
109
Output: <br/>
110
0    641698<br/>
111
1     32564<br/>
112
Name: Label, dtype: int64
113
114
For intraparenchymal sub type we have 6,41,698 images labeled as 0 and 32,564 labelled as 1.
115
116
117
##### Intraparenchymal
118
119
```python
120
train_df[train_df['sub_type'] == 'intraparenchymal']['Label'].value_counts()
121
```
122
Output: <br/>
123
0    650496<br/>
124
1     23766<br/>
125
Name: Label, dtype: int64
126
127
For intraparenchymal sub type we have 6,50,496 images labeled as 0 and 23,766 labelled as 1.
128
129
##### Subarachnoid
130
131
```python
132
train_df[train_df['sub_type'] == 'subarachnoid']['Label'].value_counts()
133
```
134
Output: <br/>
135
0    642140<br/>
136
1     32122<br/>
137
Name: Label, dtype: int64
138
139
For subarachnoid sub type we have 6,42,140 images labeled as 0 and 32,122 labelled as 1.
140
141
142
##### Subdural
143
144
```python
145
train_df[train_df['sub_type'] == 'subdural']['Label'].value_counts()
146
```
147
Output: <br/>
148
0    631766<br/>
149
1     42496<br/>
150
Name: Label, dtype: int64
151
152
For Subdural sub type we have 6,31,766 images labeled as 0 and 42,496 labelled as 1.
153
154
155
##### Any
156
157
```python
158
train_df[train_df['sub_type'] == 'any']['Label'].value_counts()
159
```
160
Output: <br/>
161
0    577159<br/>
162
1     97103<br/>
163
Name: Label, dtype: int64
164
165
For any sub type we have 5,77,159 images labeled as 0 and 97,103 labelled as 1.
166
167
### 3. Data Visualization & Preprocessing
168
169
Lets look at the dicom files in the dataset
170
171
```python
172
dicom = pydicom.read_file(path_train_img + 'ID_ffff922b9.dcm')
173
print(dicom)
174
```
175
<img src='assets/dicom.png'/>
176
177
178
Dicom data format files contain pixel data of image and other meta data like patient name, instance id, window width etc...
179
180
Original image
181
182
```python
183
plt.imshow(dicom.pixel_array, cmap=plt.cm.bone)
184
plt.show()
185
```
186
<img src='assets/original.png'/>
187
188
189
The orginal image seems to have difficult to understand, lets check meta deta features like Window Center, Window Width, Rescale Intercept, Rescale Slope 
190
191
<img src='assets/meta.png'/>
192
193
194
We can use these features to construct the new image.
195
196
```python
197
def get_dicom_field_value(key, dicom):
198
    """
199
    @param key: key is tuple
200
    @param dicom: dicom file
201
    """
202
    return dicom[key].value
203
204
window_center = int(get_dicom_field_value(('0028', '1050'), dicom))
205
window_width = int(get_dicom_field_value(('0028', '1051'), dicom))
206
window_intercept = int(get_dicom_field_value(('0028', '1052'), dicom))
207
window_slope = int(get_dicom_field_value(('0028', '1053'), dicom))
208
window_center, window_width, window_intercept, window_slope
209
210
def get_windowed_image(image, wc,ww, intercept, slope):
211
    img = (image*slope +intercept)
212
    img_min = wc - ww//2
213
    img_max = wc + ww//2
214
    img[img<img_min] = img_min
215
    img[img>img_max] = img_max
216
    return img 
217
    
218
windowed_image = get_windowed_image(dicom.pixel_array, window_center, window_width, \
219
                                    window_intercept, window_slope)
220
                                    
221
plt.imshow(windowed_image, cmap=plt.cm.bone)
222
plt.show()
223
```
224
<img src='assets/windowed.png'/>
225
226
227
228
The windowed image using meta data is much better than the orginal image this is because the dicom pixel array which contain pixel data contain raw data in Hounsfield units (HU). 
229
230
Scaling the image:
231
232
Rescale the image to range 0-255.
233
234
```python
235
def get_scaled_windowed_image(img):
236
    """
237
    Get scaled image
238
    1. Convert to float
239
    2. Rescale to 0-255
240
    3. Convert to unit8
241
    """
242
    img_2d = img.astype(float)
243
    img_2d_scaled = (np.maximum(img_2d,0) / img_2d.max()) * 255.0
244
    img_2d_scaled = np.uint8(img_2d_scaled)
245
    return img_2d_scaled
246
    
247
scaled_image = get_scaled_windowed_image(windowed_image)
248
plt.imshow(scaled_image, cmap=plt.cm.bone, vmin=0, vmax=255)
249
plt.show()
250
```
251
<img src='assets/scaled.png'/>
252
253
254
Hounsfield Units (HU) are the best source for constructing CT images. [Here](https://en.wikipedia.org/wiki/Hounsfield_scale) is detailed table showing the substance and HU range. 
255
256
A detailed explanation of all the possible windowing techniques can be found in this great kernel [(Gradient Sigmoid Windowing)](https://www.kaggle.com/reppic/gradient-sigmoid-windowing) 
257
258
```python
259
260
def correct_dcm(dcm):
261
    # Refer Jeremy Howard's Kernel https://www.kaggle.com/jhoward/from-prototyping-to-submission-fastai
262
    x = dcm.pixel_array + 1000
263
    px_mode = 4096
264
    x[x>=px_mode] = x[x>=px_mode] - px_mode
265
    dcm.PixelData = x.tobytes()
266
    dcm.RescaleIntercept = -1000
267
268
def window_image(dcm, window_center, window_width):
269
    
270
    if (dcm.BitsStored == 12) and (dcm.PixelRepresentation == 0) and (int(dcm.RescaleIntercept) > -100):
271
        correct_dcm(dcm)
272
    
273
    img = dcm.pixel_array * dcm.RescaleSlope + dcm.RescaleIntercept
274
    img_min = window_center - window_width // 2
275
    img_max = window_center + window_width // 2
276
    img = np.clip(img, img_min, img_max)
277
278
    return img
279
280
def bsb_window(dcm):
281
    brain_img = window_image(dcm, 40, 80)
282
    subdural_img = window_image(dcm, 80, 200)
283
    soft_img = window_image(dcm, 40, 380)
284
    
285
    brain_img = (brain_img - 0) / 80
286
    subdural_img = (subdural_img - (-20)) / 200
287
    soft_img = (soft_img - (-150)) / 380
288
    bsb_img = np.array([brain_img, subdural_img, soft_img]).transpose(1,2,0)
289
290
    return bsb_img
291
    
292
display_dicom_image('ID_0005d340e.dcm')
293
```
294
<img src='assets/dicom_all.png'/>
295
296
297
It looks like Brain + Subdural is a good start for our models it has three chaneels and cab be easily fed to any pretrained models. 
298
299
300
### 4. Deep Learning Model
301
302
The whole code for the training of the model can be found [here](/notebooks/Effnet-B0 Windowed Image.ipynb) 
303
304
We will using normal windowed images for training the model with augmentations like flip left right and random cropping.
305
306
Here are steps for training the model
307
308
1. Prepare train and validation data generators we will be splitting the data by stratifying the labels here id the link to [multilabel stratification](https://github.com/trent-b/iterative-stratification). We will make two splits and onlt work on the first split and check the results. 
309
2. Load pretrained Efficient Net B0 model.
310
3. For the first epoch use all the train images for training the model with the first head layers using as it as is by setting trainable as False but train all the later images and save the model.
311
4. Load the saved model and for the further epochs we train whole model except the last layer thus our model will learn most compliated features. 
312
5. Make predictions.
313
314
Sample code:
315
316
```python
317
# 1. ---------prepare data generators-------------#
318
# https://github.com/trent-b/iterative-stratification
319
# Mutlilabel stratification
320
splits = MultilabelStratifiedShuffleSplit(n_splits = 2, test_size = TEST_SIZE, random_state = SEED)
321
file_names = train_final_df.index
322
labels = train_final_df.values
323
# Lets take only the first split
324
split = next(splits.split(file_names, labels))
325
train_idx = split[0]
326
valid_idx = split[1]
327
submission_predictions = []
328
len(train_idx), len(valid_idx)
329
# train data generator
330
data_generator_train = TrainDataGenerator(train_final_df.iloc[train_idx], 
331
                                                train_final_df.iloc[train_idx], 
332
                                                TRAIN_BATCH_SIZE, 
333
                                                (WIDTH, HEIGHT),
334
                                                augment = True)
335
336
# validation data generator
337
data_generator_val = TrainDataGenerator(train_final_df.iloc[valid_idx], 
338
                                            train_final_df.iloc[valid_idx], 
339
                                            VALID_BATCH_SIZE, 
340
                                            (WIDTH, HEIGHT),
341
                                            augment = False)
342
# 2. ---------load efficient net B0 model-----------#
343
base_model =  efn.EfficientNetB0(weights = 'imagenet', include_top = False, \
344
                                 pooling = 'avg', input_shape = (HEIGHT, WIDTH, 3))
345
x = base_model.output
346
x = Dropout(0.125)(x)
347
output_layer = Dense(6, activation = 'sigmoid')(x)
348
model = Model(inputs=base_model.input, outputs=output_layer)
349
model.compile(optimizer = Adam(learning_rate = 0.0001), 
350
                  loss = 'binary_crossentropy',
351
                  metrics = ['acc', tf.keras.metrics.AUC()])
352
model.summary()
353
354
# 3. ---------for 1 st epoch train on whole dataset ------------#
355
for layer in model.layers[:-5]:
356
    layer.trainable = False
357
    
358
model.compile(optimizer = Adam(learning_rate = 0.0001), 
359
                  loss = 'binary_crossentropy',
360
                  metrics = ['acc'])
361
    
362
model.fit_generator(generator = data_generator_train,
363
                        validation_data = data_generator_val,
364
                        epochs = 1,
365
                        callbacks = callbacks_list,
366
                        verbose = 1)
367
368
# 4. ---------for rest of epochs train on sample data----------#
369
model.load_weights('model.h5')
370
model.compile(optimizer = Adam(learning_rate = 0.0004), 
371
                  loss = 'binary_crossentropy',
372
                  metrics = ['acc'])
373
model.fit_generator(generator = data_generator_train,
374
                        validation_data = data_generator_val,
375
                        steps_per_epoch=len(data_generator_train)/6,
376
                        epochs = 10,
377
                        callbacks = callbacks_list,
378
                        verbose = 1)
379
# 5. --------Make Predictions ------- --------------------------#
380
model.load_weights('model.h5')
381
382
def get_scores(data_gen, file_name='scores.pkl'):
383
    scores = model.evaluate_generator(data_gen, verbose=1)
384
    joblib.dump(scores, file_name)
385
    print(f"Loss: {scores[0]} and Accuracy: {scores[1]*100}")
386
```
387
388
Lets predict on train and validation generators.
389
390
```python
391
get_scores(data_gen=data_generator_train, file_name='train_scores.pkl')
392
```
393
<img src='assets/train.png'/>
394
395
```python
396
get_scores(data_gen=data_generator_val, file_name='val_scores.pkl')
397
```
398
<img src='assets/val.png'/>
399
400
Lets load test data frame, test data csv is also in the same format as train.csv
401
402
```python
403
# extract subtype
404
test_df['sub_type'] = test_df['ID'].apply(lambda x: x.split('_')[-1])
405
# extract filename
406
test_df['file_name'] = test_df['ID'].apply(lambda x: '_'.join(x.split('_')[:2]) + '.dcm')
407
408
test_df = pd.pivot_table(test_df.drop(columns='ID'), index="file_name", \
409
                                columns="sub_type", values="Label")
410
test_df.head()
411
412
test_df.shape
413
```
414
415
Output: (78545, 6)
416
417
So we have 78,545 test images and we need to predict 6 labels for each image. 
418
419
```python
420
preds = model.predict_generator(TestDataGenerator(test_df.index, None, VALID_BATCH_SIZE, \
421
                                                  (WIDTH, HEIGHT), path_test_img), 
422
                                verbose=1)
423
print(preds.shape)
424
```
425
Output: (78545, 6)
426
427
428
As per sample submission given by kaggle it is in a different format, the submission should be made with ID and Label column where ID is in the form of <b>dicomId_subType</b>(Ex:ID_0fbf6a978_subarachnoid) so we need format this to convert each prediction to 6 rows each indicating the id with sub type and its probability. The following code generates the required format for submission.
429
430
```python
431
def create_download_link(title = "Download CSV file", filename = "data.csv"):  
432
    """
433
    Helper function to generate download link to files in kaggle kernel 
434
    """
435
    html = '<a href={filename}>{title}</a>'
436
    html = html.format(title=title,filename=filename)
437
    return HTML(html)
438
439
def generate_submission_file(preds):
440
    from tqdm import tqdm
441
442
    cols = list(train_final_df.columns)
443
444
    # We have preditions for each of the image
445
    # We need to make 6 rows for each of file according to the subtype
446
    ids = []
447
    values = []
448
    for i, j in tqdm(zip(preds, test_df.index.to_list()), total=preds.shape[0]):
449
    #     print(i, j)
450
        # i=[any_prob, epidural_prob, intraparenchymal_prob, intraventricular_prob, subarachnoid_prob, subdural_prob]
451
        # j = filename ==> ID_xyz.dcm
452
        for k in range(i.shape[0]):
453
            ids.append([j.replace('.dcm', '_' + cols[k])])
454
            values.append(i[k])      
455
456
    df = pd.DataFrame(data=ids)
457
    df.head()
458
459
    sample_df = pd.read_csv(input_folder + 'stage_1_sample_submission.csv')
460
    sample_df.head()
461
462
    df['Label'] = values
463
    df.columns = sample_df.columns
464
    df.head()
465
466
    df.to_csv('submission.csv', index=False)
467
468
    return create_download_link(filename='submission.csv')
469
```
470
471
```python
472
df = pd.read_csv('submission.csv')
473
df.head()
474
```
475
476
<img src='assets/sample_sub.png'/>
477
478
All notebooks can be found [here](https://github.com/suryachintu/RSNA-Intracranial-Hemorrhage-Detection/tree/master/notebooks)
479
480
### 5. Demo
481
482
You can test the model by uploading the DICOM file [here](http://34.93.89.75:5325/)
483
484
485
### References
486
487
https://my.clevelandclinic.org/health/diseases/14480-intracranial-hemorrhage-cerebral-hemorrhage-and-hemorrhagic-stroke<br/>
488
https://github.com/MGH-LMIC/windows_optimization<br/>
489
https://arxiv.org/abs/1812.00572(Must read)
490
https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/discussion/111325#latest-650043
491
https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/discussion/109261#latest-651855
492
493
### Kaggle Kernels
494
495
https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai
496
https://www.kaggle.com/reppic/gradient-sigmoid-windowing
497
https://www.kaggle.com/jhoward/from-prototyping-to-submission-fastai
498
https://www.kaggle.com/suryaparsa/rsna-basic-eda-part-1
499
https://www.kaggle.com/suryaparsa/rsna-basic-eda-part-2