--- a +++ b/README.md @@ -0,0 +1,56 @@ +# Lymphoma Classification 2023 +This is the repository for the modified and reduced code and trained model LARS presented in the paper: + +"_Deep learning for [18F]fluorodeoxyglucose-PET-CT classification in patients with lymphoma: a dual-centre retrospective analysis_", +by I. Häggström et al., published in The Lancet Digital Health (2023). + +The code is for LARS-avg/max. + +# Framework +The codes are run using PyTorch. + +# Description of files +* `train.py`: script to train the model (here set to 2d ResNet34). Give appropriate arguments to the function (use `--help`). +* `predict.py`: script to uses the trained model for inference on the test set. Give appropriate arguments to the function (use `--help`). +* `dataset.py`: script containing all functions for loading and handling images. Called by `train.py` and `predict.py`. Note that the images are assumed to be stored in a binary float32 format. This can of course be altered to fit ones own data. +* `utils.py`: script containing the available models (here trimmed down to only ResNet34). Called by `train.py` and `predict.py`. +* `find_best_model.py`: script to create csv-file with ranking of the best performing models (used to choose top-10 ensemble). Give appropriate arguments to the function (use `--help`). +* `data.csv`: file with a dataframe containing all information about the image filenames and sizes, the binary target (0 or 1), as well as ensemble data splits. The file should contain (at least) the columnns below. E.g., there should be one column per each of N bootstrap data split, named _split0_..._splitN_ with the allocated split _train_, _val_, or _test_. Below, each scan has one coronal and one sagittal image (which of course should have same targets and same data split allocations). +``` + df = + scan_id filename target matrix_size_1 matrix_size_2 split0 split1 ... splitN + 0 image_0_cor.bin 0 250 250 train train val + 0 image_0_sag.bin 0 180 250 train train val + 1 image_1_cor.bin 1 234 210 train train train + 1 image_1_sag.bin 1 140 210 train train train + 2 image_2_cor.bin 1 245 199 train val train + 2 image_2_sag.bin 1 120 199 train val train + 3 image_3_cor.bin 0 189 249 test test test + 3 image_3_sag.bin 0 150 249 test test test + ... ... ... ... ... ... ... ... + M image_M_cor.bin 0 201 236 val train train + M image_M_sag.bin 0 120 236 val train train +``` +* `convergence_split*_run*.csv`: file containing run of the epochs (epoch, loss, auc,...). Will be created upon running `train.py`. +* `checkpoint_split*_run*.pth`: file containing the checkpoint data (state_dict etc) of your model. Will be created upon running `train.py`, and overwritten with the latest epoch every iteration. +* `pred_split*_run*.csv`: file containing inference model predictions. Will be created when running `predict.py`. +* `best_run.csv`: file containing ranked results (from convergence files) for all N trained models. Will be created when running `find_best_model.py`. + +# How to run +1. Create dataframe according to template above `data.csv` (you can name the file however you want). +2. Update the path to your image files and dataframe filename at the top in `dataset.py`. +3. Start model training by running `>> python train.py <your arguments>`. This will save convergence files `convergence_split*_run*.csv` and model checkpoint `checkpoint_split*_run*.pth` in the output folder you specified. Run training on all your N data splits (to create ensemble). +4. After finishing training, run `>> python find_best_model.py <your arguments>` to evaluate ranking of your N trained models. This will create file `best_run.csv` in the same folder as the convergence + checkpoint files. +5. Run inference on test set by `>> python predict.py <your arguments>`. You decide which of your N models to run and the function uses the file `best_run.csv` from the previous step. This will save prediction file `pred_split*_run*.csv` in your specified output folder. +6. Analyze the predictions by grouping predictions on `scan_id` and aggregating by mean or max (LARS-avg or LARS-max), and averaging the final result for the top-10 models. + +# Trained models +Downloadable checkpoints of top-10 trained models are found here: [link](https://drive.google.com/drive/folders/1V-hhATi3zaqAiVyZ8_hgE3zhtSdt2HbV?usp=sharing). + +# Cite this work +Cite this work using: [BibTeX](Häggström2023.bib) + +# License for use +The IP rights of the model is owned by Memorial Sloan Kettering Cancer Center, New York, NY, USA.\ +Sharing is done under the Creative Commons Attribution-NonCommercial 4.0 International license as seen in the file [LICENSE-CC-BY-NC-4.0.md](LICENSE-CC-BY-NC-4.0.md). +Read more at https://creativecommons.org/licenses/by-nc/4.0/.