|
a |
|
b/documentation/trainddp.md |
|
|
1 |
# How to train a model using this codebase? |
|
|
2 |
|
|
|
3 |
The models in this work are trained on a single-node with `torch.cuda.device_count()` GPUs. In our work, we had `torch.cuda.device_count() == 4` on a single Microsoft Azure VM (node). Each GPU consisted of 16 GiB of RAM. The machine consisted of 24 vCPUs and 448 GiB of RAM. |
|
|
4 |
|
|
|
5 |
Running a training experiment primarily uses only three files from this codebase: [config.py](./../config.py), [segmentation/trainddp.py](./../segmentation/trainddp.py) and [segmentation/initialize_train.py](./../segmentation/initialize_train.py). The first step is to initialize the correct values for the variable `LYMPHOMA_SEGMENTATION_FOLDER` in the [config.py](./../config.py). Put all the training (and test, if applicable) data inside the `LYMPHOMA_SEGMENTATION_FOLDER/data` folder, as described in [dataset_format.md](./dataset_format.md). |
|
|
6 |
|
|
|
7 |
``` |
|
|
8 |
import os |
|
|
9 |
|
|
|
10 |
LYMPHOMA_SEGMENTATION_FOLDER = '/path/to/lymphoma.segmentation/folder/for/data/and/results' # path to the directory containing `data` and `results` (this will be created by the pipeline) folders. |
|
|
11 |
|
|
|
12 |
DATA_FOLDER = os.path.join(LYMPHOMA_SEGMENTATION_FOLDER, 'data') |
|
|
13 |
RESULTS_FOLDER = os.path.join(LYMPHOMA_SEGMENTATION_FOLDER, 'results') |
|
|
14 |
os.makedirs(RESULTS_FOLDER, exist_ok=True) |
|
|
15 |
WORKING_FOLDER = os.path.dirname(os.path.abspath(__file__)) |
|
|
16 |
``` |
|
|
17 |
|
|
|
18 |
If all the dataset is correctly configured based on the explanations in [dataset_format.md](./dataset_format.md) and the [config.py](./../config.py) is correctly initialized as well, you are all set to initiate the training script. |
|
|
19 |
|
|
|
20 |
## Step 1: Activate the required conda environment (`lymphoma_seg`) and navigate to `segmentation` folder |
|
|
21 |
First, activate the conda environment `lymphoma_seg` using (created as described in [conda_env.md](./conda_env.md)): |
|
|
22 |
|
|
|
23 |
``` |
|
|
24 |
conda activate lymphoma_seg |
|
|
25 |
cd segmentation |
|
|
26 |
``` |
|
|
27 |
|
|
|
28 |
## Step 2: Run the training script |
|
|
29 |
After this, run the following script in your terminal: |
|
|
30 |
|
|
|
31 |
``` |
|
|
32 |
torchrun --standalone --nproc_per_node=1 trainddp.py --fold=0 --network-name='unet' --epochs=500 --input-patch-size=192 --train-bs=1 --num_workers=2 --cache-rate=0.5 --lr=2e-4 --wd=1e-5 --val-interval=2 --sw-bs=2 |
|
|
33 |
``` |
|
|
34 |
|
|
|
35 |
Here, we are using PyTorch's `torchrun` to start a multi-GPU training. The `standalone` represents that we are using just one node. |
|
|
36 |
|
|
|
37 |
- `--nproc_per_node` defines the number of processes per node; in this case it represents the number of GPUs you want to use to train your model. We used `--nproc_per_node=4`, but feel free to set this variable to the number of GPUs available in your machine. |
|
|
38 |
|
|
|
39 |
- `trainddp.py` is the file containing the code for training that uses `torch.nn.parallel.DistributedDataParallel`. |
|
|
40 |
|
|
|
41 |
- `--fold` defines the fold for which you want to run training. When the above script is run for the first time, two files, namely, `train_filepaths.csv` and `test_filepaths.csv` gets created within the folder `WORKING_FOLDER/data_split`, where the former contains the filepaths (CT, PT, mask) for training images (from `imagesTr` and `labelsTr` folders as described in `dataset_format.md`), and the latter contains the filepaths for test images (from `imagesTs` and `labelsTs`), respectively. The `train_filepaths.csv` contains a column named `FoldID` with values in `{0, 1, 2, 3, 4}` defining which fold the data in that row belongs to. When `--fold=0` (for example), the code uses all the data with `FoldID == 0` for validation and the data with `FoldID != 0` for training. Defaults to 0. |
|
|
42 |
|
|
|
43 |
- `--network-name` defines the name of the network. In this work, we have trained UNet, SegResNet, DynUNet and SwinUNETR (adpated from MONAI [LINK]). Hence, the `--network-name` should be set to one of `unet`, `segresnet`, `dynunet`, or `swinunetr`. Defaults to `unet`. |
|
|
44 |
|
|
|
45 |
- `--epochs` is the total number of epochs for running the training. Defaults to 500. |
|
|
46 |
|
|
|
47 |
- `--input-patch-size` defines the size of the cubic input patch that is cropped from the input images during training. The code uses `monai.transforms.RRandCropByPosNegLabeld` (used inside `segmentation\initialize_train.py`) for creating these cropped patches. We used `input-patch-size` of 224 for UNet, 192 for SegResNet, 160 for DynUNet and 128 for SwinUNETR. Defaults to 192. |
|
|
48 |
|
|
|
49 |
- `--train-bs` is the training batch size. We used `--train-bs = 1` for all our experiments in this work, since for the given `input-patch-size` for the networks above, we couldn't accommodate larger batch sizes for SegResNet, DynUNet, and SwinUNETR. Defaults to 1. |
|
|
50 |
|
|
|
51 |
- `--num-workers` defines the `num_workers` argument inside training and validation DataLoaders. Defaults to 2. |
|
|
52 |
|
|
|
53 |
- `--cache-rate` defines the precentage of cached data argument to be used inside the `monai.data.CacheDataset`. This type of dataset (unlike `torch.utils.data.Dataset`) can load and cache deterministic transforms result during training. A cache-rate of 1 caches all the data into the memory, while a cache-rate of 0 doesn't cache anything into the memory. A higher cache rate leads to faster training (but more memory consumption). Defaults to 0.1. |
|
|
54 |
|
|
|
55 |
- `--lr` defines the initial learning rate. Cosine annealing scheduler is used to update the learning rate from the initial value to 0 in `epochs` epochs. Defaults to 2e-4. |
|
|
56 |
|
|
|
57 |
- `--wd` defines the weight-decay for the AdamW optimizer used in this work. Defaults to 1e-5. |
|
|
58 |
|
|
|
59 |
- `--val_interval` defines the interval for performing validation and saving the model being trained. Defaults to 2. |
|
|
60 |
|
|
|
61 |
- `--sw-bs` defines the batch size for performing the sliding-window inference via `monai.inferers.sliding_window_inference` on the validation inputs. Defaults to 2. |
|
|
62 |
|
|
|
63 |
|
|
|
64 |
|
|
|
65 |
Alternatively, modify the [segmentation/train.sh](./../segmentation/train.sh) script for your use-case (which contains the same bash script as above) and run: |
|
|
66 |
|
|
|
67 |
``` |
|
|
68 |
bash train.sh |
|
|
69 |
``` |