Diff of /README.md [000000] .. [77dc1e]

Switch to side-by-side view

--- a
+++ b/README.md
@@ -0,0 +1,118 @@
+# RSNA Intracranial Hemorrhage Detection
+https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection
+
+A video of our solution can be found here: https://www.youtube.com/watch?v=1zLBxwTAcAs
+
+# Hardware used
+* 2x NVidia RTX 2080Ti GPUs
+* 128GB RAM
+* 16-core AMD CPU
+* Data stored on NVMe drives in RAID configuration
+* OS: Ubuntu 19.04
+
+# Steps to reproduce results
+1. Modify the data paths at the top of `data_prep.py`, `datasets.py` & `model.py`
+
+2. Run `data_prep.py`. This takes around 12-15 hours for each set of images and will create:
+    * `train_metadata.parquet.gzip`
+    * `stage_1_test_metadata.parquet.gzip`
+    * `stage_2_test_metadata.parquet.gzip`
+    * `train_triplets.csv`
+    * `stage_1_test_triplets.csv`
+    * `stage_1_test_triplets.csv`
+    * A folder called `png` for all 3 stages containing the preprocessed & cropped images
+    
+3. Run `batch_run.sh`. This will train (using 5 fold CV) and make submission files (as well as 
+out-of fold predictions) using:
+    * EfficientNet-B0 (224x224 model for fast experimentation)
+    * EfficientNet-B5 (456x456, 3-slice model)
+    * EfficientNet-B3 (300x300, 3-window model)
+    * ~~DenseNet-169~~
+    * ~~SE-ResNeXt101_32x4d~~
+    
+    This will create a timestamped folder in the `OUTPUT_DIR` containing:
+    * A submission file
+    * Out-of-fold (OOF) predictions (for later stacking models)
+    * Model checkpoints for each fold
+    * QC plots (e.g. ROC curves, train/validation loss curves)
+    
+4. To infer on different datasets:
+    * In `config.yml` set `stage` to either `test1` or `test2`
+    * For the `checkpoint` eneter the name of the timestamped folder containing the model checkpoints
+    * Set epochs to 1 (this will skip the training part)
+    * Re-run the models using `batch_run.sh`. A new output directory will be created with the 
+    predictions
+    * If a completely new dataset is being used, the file paths in `ICHDataset` found in 
+    `datasets.py` will need to be modified
+    
+# Model summary
+
+## Primary model (3-slice model)
+1. First the metadata is collected from the individual DICOM images. This allows the studies to be
+grouped by `PatientID` which is important for a stable cross validation due to overlapping patients
+
+2. Based on `StudyInstanceUID` and sorting on `ImagePositionPatient` it is possible to reconstruct
+3D volumes for each study. However since each study contained a variable number of axial slices
+(between 20-60) this makes it difficult to create a architecture that implements 3D convolutions. 
+Instead, triplets of images were created from the 3D volumes to represent the RGB channels of an 
+image, i.e. the green channel being the target image and the red & blue channels being the adjacent
+images. If an image was at the edge of the volume, then the green channel was repeated. This is 
+essentially a 3D volume but only using 3 axial slices. At this stage no windowing was applied 
+and the image is retained in Hounsfield units.
+
+3. The images then had the objects labeled using `scipy.ndimage.label` which looks for groups of 
+connected pixels. The group with the second largest number of pixels was assumed to be the head 
+(the first largest group is the background). This removes most of the dead space and the headrest
+of the CT scanner. A 10 pixel border was retained to keep some space for rotation augmentations
+
+4. The images were clipped between 0-255 Hounsfield units and saved as 8-bit PNG files to pass to 
+the PyTorch dataset object. The reason for this was a) most of the interesting features are 
+between 0-255 HU so we shouldn't be losing too much detail and b) this makes it easier to try 
+different windows without recreating the images. I also found that processing DICOM images on the
+fly was too slow to keep 2 GPUs busy when small images/large batch sizes were used, (224x224, 
+batch size=256) which is why I went down the PNG route.
+
+5. The images are then windowed once loaded into the dataset. A subdural window with 
+`window_width, window_length = 200, 80` was used on all 3 channels. 
+
+## Alternative model (3-window model)
+An alternative model using different CT windowing for each channel is also used. Here the windows
+are applied when the PNG images are made according to the channels in `prepare_png` found in 
+`data_prep.py`. The images are cropped in the same way above. The windows used were :
+* Brain - `window_width, window_length = 80, 40`
+* Subdural - `window_width, window_length = 200, 80`
+* Bone - `window_width, window_length = 2000, 600`
+
+## Model details
+
+1. Image augmentations: 
+    1. `RandomHorizontalFlip(p=0.5)`
+    2. `RandomRotation(degrees=15)`
+    3. `RandomResizedCrop(scale=(0.85, 1.0), ratio=(0.8, 1.2))`
+    
+2. The models were then trained as follows:
+    * 5 folds defined by grouping on `PatientID`. See the section below on the CV scheme.
+    * All the data used (no down/up sampling)
+    * 10 epochs with early stopping (patience=3)
+    * AdamW optimiser with default parameters
+    * Learning rate decay using cosine annealing
+    * Loss function: custom weighted multi-label log loss with `weights=[2, 1, 1, 1, 1, 1]`
+    * Image size: 512x512. No pre-normalisation (i.e. ImageNet stats were not used)
+    * Batch size: as large as possible depending on the architecture
+    
+3. Postprocessing:
+    * Test time augmentation (TTA): Identity, horizontal flip, rotate -10 degrees & +10 degrees 
+    * Take the mean of all 5 folds
+    * A prediction smoothing script based on the relative positions of the axial slices 
+    (this script was used by all teammates)
+    
+# Cross validation scheme
+The CV scheme was fixed using 5 pairs of CSV files agreed by the team in the format `train_n.csv` 
+& `valid_n.csv`. The first version of these files were designed to prevent the same patient appearing 
+in train & validation sets. A second version of these files were made removing patients that were present 
+in the train & stage 1 test sets to prevent fitting to the overlapping patients on the stage 1 public LB. 
+Some of these models are trained with the V1 scheme and others with the V2 scheme (most of the team used 
+the latter)
+ 
+These CSV files are included in a file called `team_folds.zip` and should be in the same folder
+as the the rest of the input data. The scheme is selected using the `cv_scheme` value in the config file