Diff of /experiments/tutorial.md [000000] .. [bb7f56]

Switch to unified view

a b/experiments/tutorial.md
1
# Tutorial
2
##### for Including a Dataset into the Framework
3
4
## Introduction
5
This tutorial aims at providing a muster routine for including a new dataset into the framework in order to 
6
use the included models and algorithms with it.\
7
The tutorial and toy dataset (under `toy_exp`) are in 2D, yet the switch to 3D is simply made by providing 3D data and proceeding 
8
analogically, as can be seen from the provided LIDC scripts (under `lidc_exp`).
9
10
Datasets in the framework are set up under `medicaldetectiontoolkit/experiments/<DATASET_NAME>` and
11
require three fundamental scripts:
12
1. A **preprocessing** script that performs one-time routines on your raw data bringing it into a suitable, easily usable 
13
format.
14
2. A **data-loading** script (required name `data_loader.py`) that efficiently assembles the preprocessed data into
15
network-processable batches.
16
3. A **configs** file (`configs.py`) which specifies all settings, from data loading to network architecture. 
17
This file is automatically complemented by `default_settings.py` which holds default and dataset-independent settings.
18
19
## Preprocessing
20
This script (`generate_toys.py` in case of the provided toy dataset, `preprocessing.py` in case of LIDC) is required
21
to bring your raw data into an easily usable format. We recommend, you put all one-time processes (like normalization, 
22
resampling, cropping, type conversions) into this script in order to avoid the need for repetitive actions during 
23
data loading.\
24
For the framework usage, we follow a simple workload separation scheme, where network computations
25
are performed on the GPU while data loading and augmentations are performed on the CPU. Hence, the framework requires 
26
numpy arrays (`.npy`) as input to the networks, therefore your preprocessed data (images and segmentations) should 
27
already be in that format. In terms of data dimensions, we follow the scheme: (y, x (,z)), meaning coronal, sagittal, 
28
and axial dimensions, respectively.
29
30
Class labels for the Regions of Interest (RoIs) need to be provided as lists per data sample.
31
If you have segmenation data, you may use the [batchgenerators](https://github.com/MIC-DKFZ/batchgenerators) transform 
32
ConvertSegToBoundingBoxCoordinates to generate bounding boxes from your segmentations. In that case, the order of the 
33
class labels in the list needs to correspond to the RoI labels in the segmentation.\
34
Example: An image (2D or 3D) has two RoIs, one of class 1, the
35
other of class 0. In your segmentation, every pixel is 0 (bg), except for the area marking class 1, which has value 1, 
36
and the area of class 0, which has value 2. Your list of class labels for this sample should be `[1, 0]`. I.e.,
37
the index of the RoI's class label in the sample's label list corresponds to its marking in the segmentation shifted 
38
by -1.\
39
If you do not have segmentations (only models Faster R-CNN and RetinaNet can be used), you can directly provide bounding
40
boxes. In that case, RoIs are simply identified by their indices in the lists: class label list `[cl_of_roi_a, cl_of_roi_b]` 
41
corresponds to bbox list `[coords_of_roi_a, coords_of_roi_b]`.
42
43
Please store all your light-weight information (patient id, class targets, (relative) paths or identifiers for data and seg) about the
44
preprocessed data set in a pandas dataframe, say `info_df.pkl`. 
45
46
## Data Loading
47
The goal of `data_loader.py` is to sample or iterate, load into CPU RAM, assemble, and eventually augment the preprocessed data.\
48
The framework requires the data loader to provide at least a function `get_train_generators`, which yields a dict
49
holding a train-data loader under key `"train"` and validation loader under `"val_sampling"` or `"val_patient"`, 
50
analogically for `get_test_generator` with `"test"`.\
51
We recommend you closely follow our structure as in the provided datasets, which includes a data loader suitable for 
52
sampling single patches or parts of the whole patient data with focus on class equilibrium (BatchGenerator,
53
used in training and optionally validation) and a PatientIterator which is intended for test and optionally valdiation and
54
 iterates through all patients one by one, not discarding 
55
any parts of the patient image. In detail, the structure is as follows.
56
57
Data loading is performed with the help of the batchgenerators package. Starting from farthest to closest to the 
58
preprocessed data, the data loader contains:
59
1. Method `get_train_generators` which is called by the execution script and in the end provides train and val data loaders.
60
 Same goes for `get_test_generator` for the test loader.
61
2. Method `load_dataset` which reads the `info_df.pkl` and provides a dictionary holding, per patient id, paths
62
 to images and segmentations, and light-weight info like class targets.
63
3. Method `create_data_gen_pipeline` which initiates the train data loader (instance of class BatchGenerator),
64
assembles the chosen data-augmentation procedures and passes the BatchGenerator into a MultiThreadedAugmenter (MTA). The MTA
65
is a wrapper that manages multi-threaded loading (and augmentation).
66
4. Class BatchGenerator. This data loader is used for sampling, e.g., according to the scheme described in 
67
`utils/dataloader_utils.get_class_balanced_patients`. It needs to implement a `__next__` method providing the batch; 
68
the batch is a dictionary with (at least) keys: `"data"`, `"pid"`, `"class_target"` (as well as `"seg"` if using segmentations).
69
    - `"data"` needs to hold your image (2D or 3D) as a numpy array with dimensions: (b, c, y, x(, z)), where b is the 
70
    batch dimension (b = batch size), c the channel dimension (if you have multi-modal data c > 1), y, x, z are 
71
    the spatial dimensions; z is omitted in case of 2D data. Since the batchgenerators package uses shape convention (x,y,z), 
72
    please make sure you switch augmentation settings explicitly affecting x and y (like rotation angle) accordingly. 
73
    - `"seg"` has the same format as `"data"`, except that its channel dimension has always size c = 1.
74
    - `"pid"` is a list of patient or sample identifiers, one per sample, i.e., shape (b,).
75
    - `"class_target"` which holds, as mentioned in preprocessing, class labels for the RoIs. It's a list of length b, holding
76
    itself lists of varying lengths n_rois(sample). 
77
    **Note**: the above description only applies if you use ConvertSegToBoundingBoxCoordinates. Class targets after batch 
78
    generation need to make room for a background class (network heads need to be able to predict class 0 = bg). Since, 
79
    in preprocessing, we started classes at id 0, we now need to shift them by +1. This is done automatically inside
80
    ConvertSegToBoundingBoxCoordinates. In case you do not use that transform, please shift the labels in your BatchGenerator.
81
5. Class PatientIterator. This data loader is intended for testing and validation. It needs to provide the same output as 
82
above BatchGenerator, however, initial batch size is always limited to one (one patient). Output batch size may vary 
83
 if patching is applied. Please refer to the LIDC PatientIterator 
84
to see how to include patching. Note that this Iterator is _not_ supposed to go through the MTA, transforms (mainly 
85
ConvertSegToBoundingBoxCoordinates) therefore need to be applied within this class directly.
86
87
88
## Configs
89
The current work flow is intended for running multiple experiments with the same dataset but different configs. This is
90
done by setting the desired values in `configs.py` in the data set's source directory, then creating an experiment
91
via the execution script (`exec.py`, modes "create_exp" or "train" or "train_test"), which copies a snapshot of configs, 
92
data loader, default configs, and selected model to the provided experiment directory.
93
94
`configs.py` introduces class `configs`, which, when instantiated, inherits settings in `default_configs.py` and adds 
95
model-specific settings to itself. Aside from setting all the right input/output paths, you can tune almost anything, from
96
network architecture to data-loading settings to train and test routine settings.\
97
Furthermore, throughout the whole framework, you have the option to include server-environment specific settings by passing
98
argument `--server_env` to the exec script. E.g., in the configs, we use this flag, to overwrite local paths by the
99
paths we use on our GPU cluster.