Diff of /README.md [000000] .. [77dc1e]

Switch to unified view

a b/README.md
1
# RSNA Intracranial Hemorrhage Detection
2
https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection
3
4
A video of our solution can be found here: https://www.youtube.com/watch?v=1zLBxwTAcAs
5
6
# Hardware used
7
* 2x NVidia RTX 2080Ti GPUs
8
* 128GB RAM
9
* 16-core AMD CPU
10
* Data stored on NVMe drives in RAID configuration
11
* OS: Ubuntu 19.04
12
13
# Steps to reproduce results
14
1. Modify the data paths at the top of `data_prep.py`, `datasets.py` & `model.py`
15
16
2. Run `data_prep.py`. This takes around 12-15 hours for each set of images and will create:
17
    * `train_metadata.parquet.gzip`
18
    * `stage_1_test_metadata.parquet.gzip`
19
    * `stage_2_test_metadata.parquet.gzip`
20
    * `train_triplets.csv`
21
    * `stage_1_test_triplets.csv`
22
    * `stage_1_test_triplets.csv`
23
    * A folder called `png` for all 3 stages containing the preprocessed & cropped images
24
    
25
3. Run `batch_run.sh`. This will train (using 5 fold CV) and make submission files (as well as 
26
out-of fold predictions) using:
27
    * EfficientNet-B0 (224x224 model for fast experimentation)
28
    * EfficientNet-B5 (456x456, 3-slice model)
29
    * EfficientNet-B3 (300x300, 3-window model)
30
    * ~~DenseNet-169~~
31
    * ~~SE-ResNeXt101_32x4d~~
32
    
33
    This will create a timestamped folder in the `OUTPUT_DIR` containing:
34
    * A submission file
35
    * Out-of-fold (OOF) predictions (for later stacking models)
36
    * Model checkpoints for each fold
37
    * QC plots (e.g. ROC curves, train/validation loss curves)
38
    
39
4. To infer on different datasets:
40
    * In `config.yml` set `stage` to either `test1` or `test2`
41
    * For the `checkpoint` eneter the name of the timestamped folder containing the model checkpoints
42
    * Set epochs to 1 (this will skip the training part)
43
    * Re-run the models using `batch_run.sh`. A new output directory will be created with the 
44
    predictions
45
    * If a completely new dataset is being used, the file paths in `ICHDataset` found in 
46
    `datasets.py` will need to be modified
47
    
48
# Model summary
49
50
## Primary model (3-slice model)
51
1. First the metadata is collected from the individual DICOM images. This allows the studies to be
52
grouped by `PatientID` which is important for a stable cross validation due to overlapping patients
53
54
2. Based on `StudyInstanceUID` and sorting on `ImagePositionPatient` it is possible to reconstruct
55
3D volumes for each study. However since each study contained a variable number of axial slices
56
(between 20-60) this makes it difficult to create a architecture that implements 3D convolutions. 
57
Instead, triplets of images were created from the 3D volumes to represent the RGB channels of an 
58
image, i.e. the green channel being the target image and the red & blue channels being the adjacent
59
images. If an image was at the edge of the volume, then the green channel was repeated. This is 
60
essentially a 3D volume but only using 3 axial slices. At this stage no windowing was applied 
61
and the image is retained in Hounsfield units.
62
63
3. The images then had the objects labeled using `scipy.ndimage.label` which looks for groups of 
64
connected pixels. The group with the second largest number of pixels was assumed to be the head 
65
(the first largest group is the background). This removes most of the dead space and the headrest
66
of the CT scanner. A 10 pixel border was retained to keep some space for rotation augmentations
67
68
4. The images were clipped between 0-255 Hounsfield units and saved as 8-bit PNG files to pass to 
69
the PyTorch dataset object. The reason for this was a) most of the interesting features are 
70
between 0-255 HU so we shouldn't be losing too much detail and b) this makes it easier to try 
71
different windows without recreating the images. I also found that processing DICOM images on the
72
fly was too slow to keep 2 GPUs busy when small images/large batch sizes were used, (224x224, 
73
batch size=256) which is why I went down the PNG route.
74
75
5. The images are then windowed once loaded into the dataset. A subdural window with 
76
`window_width, window_length = 200, 80` was used on all 3 channels. 
77
78
## Alternative model (3-window model)
79
An alternative model using different CT windowing for each channel is also used. Here the windows
80
are applied when the PNG images are made according to the channels in `prepare_png` found in 
81
`data_prep.py`. The images are cropped in the same way above. The windows used were :
82
* Brain - `window_width, window_length = 80, 40`
83
* Subdural - `window_width, window_length = 200, 80`
84
* Bone - `window_width, window_length = 2000, 600`
85
86
## Model details
87
88
1. Image augmentations: 
89
    1. `RandomHorizontalFlip(p=0.5)`
90
    2. `RandomRotation(degrees=15)`
91
    3. `RandomResizedCrop(scale=(0.85, 1.0), ratio=(0.8, 1.2))`
92
    
93
2. The models were then trained as follows:
94
    * 5 folds defined by grouping on `PatientID`. See the section below on the CV scheme.
95
    * All the data used (no down/up sampling)
96
    * 10 epochs with early stopping (patience=3)
97
    * AdamW optimiser with default parameters
98
    * Learning rate decay using cosine annealing
99
    * Loss function: custom weighted multi-label log loss with `weights=[2, 1, 1, 1, 1, 1]`
100
    * Image size: 512x512. No pre-normalisation (i.e. ImageNet stats were not used)
101
    * Batch size: as large as possible depending on the architecture
102
    
103
3. Postprocessing:
104
    * Test time augmentation (TTA): Identity, horizontal flip, rotate -10 degrees & +10 degrees 
105
    * Take the mean of all 5 folds
106
    * A prediction smoothing script based on the relative positions of the axial slices 
107
    (this script was used by all teammates)
108
    
109
# Cross validation scheme
110
The CV scheme was fixed using 5 pairs of CSV files agreed by the team in the format `train_n.csv` 
111
& `valid_n.csv`. The first version of these files were designed to prevent the same patient appearing 
112
in train & validation sets. A second version of these files were made removing patients that were present 
113
in the train & stage 1 test sets to prevent fitting to the overlapping patients on the stage 1 public LB. 
114
Some of these models are trained with the V1 scheme and others with the V2 scheme (most of the team used 
115
the latter)
116
 
117
These CSV files are included in a file called `team_folds.zip` and should be in the same folder
118
as the the rest of the input data. The scheme is selected using the `cv_scheme` value in the config file