Switch to unified view

a b/docs-source/source/datasets_and_val.rst
1
.. currentmodule:: slideflow.dataset
2
3
.. _datasets_and_validation:
4
5
Datasets
6
========
7
8
Working with large-scale imaging data can be both challenging and messy, so Slideflow provides the :class:`Dataset` class to assist with managing, splitting, filtering, and transforming your data for easy downstream use. :class:`Dataset` organizes a set of image tiles extracted at a specific size, along with their associated slides and clinical annotations. Datasets are used for many Slideflow functions, and can quickly generate ``torch.utils.data.DataLoader`` and ``tf.data.Datasets`` objects that provide preprocessed slide images for external applications.
9
10
Dataset Sources
11
***************
12
13
Datasets are comprised of one or more *sources*, which are a set of slides, Regions of Interest (if available), and any tiles extracted from these slides. You might choose to organize your data into separate sources if slides are organized into distinct locations on disk - for example, if you are using multiple sets of slides from different institutions, with data from each institution stored separately.
14
15
Loading a Dataset
16
*****************
17
18
Datasets can be created either from a :ref:`Project <project_setup>` - using the project's dataset configuration file - or directly by providing paths to slides, annotations, and image tile destinations. In the next sections, we'll take a look at how to create a :class:`Dataset` with each method.
19
20
From a project
21
--------------
22
23
If you are working in the context of a :ref:`Project <project_setup>`, a dataset can be quickly created using :meth:`Project.dataset`. A dataset can be loaded from a given ``Project`` with the following parameters:
24
25
- ``tile_px`` is the tile size, in pixels
26
- ``tile_um`` is the tile size, in microns (``int``) or magnification (``'40x'``)
27
- ``sources`` is an optional list of dataset sources to use
28
29
.. code-block:: python
30
31
    import slideflow as sf
32
33
    P = sf.load_project('/project/path')
34
    dataset = P.dataset(tile_px=299, tile_um='10x', sources=['Source1'])
35
36
If ``sources`` is not provided, all available sources will be used.
37
38
Alternatively, you can accomplish the same by creating a :class:`Dataset` object directly, passing in the project :ref:`dataset configuration file <dataset_sources>` to the ``config`` argument, and a path to the annotations file to ``annotations``:
39
40
.. code-block:: python
41
42
    dataset = sf.Dataset(
43
        config='config.json',
44
        sources=['Source1'],
45
        annotations='annotations.csv',
46
        tile_px=299,
47
        tile_um='10x'
48
    )
49
50
Manually from paths
51
-------------------
52
53
You can also create a dataset by manually supplying paths to slides, destination for image tiles, and clinical annotations. A single dataset source will be created from the provided arguments, which include:
54
55
- ``tile_px`` is the tile size, in pixels
56
- ``tile_um`` is the size in microns or magnification
57
- ``slides`` is the directory containing whole-slide images
58
- ``roi`` is the directory containing Regions of Interest \*.csv files
59
- ``tfrecords`` is the path to where image tiles should be stored in TFRecords
60
- ``tiles`` is the path to where image tiles should be stored as \*.jpg images
61
- ``annotations`` is either an annotations file (CSV) or Pandas DataFrame.
62
63
For example, to create a dataset from a set of slides, with a configured TFRecord directory and annotations provided via Pandas DataFrame:
64
65
.. code-block:: python
66
67
    import pandas as pd
68
69
    # Create some clinical annotations
70
    df = pd.DataFrame(...)
71
72
    # Create a dataset
73
    dataset = sf.Dataset(
74
        slides='/slides',
75
        tfrecords='/tfrecords',
76
        annotations=df,
77
        tile_px=299,
78
        tile_um='10x'
79
    )
80
81
When creating a :class:`Dataset` manually from paths, tfrecords should be organized into subdirectories named according to tile size. Using the above example, the tfrecords directory should look like:
82
83
.. code-block:: none
84
85
    /tfrecords
86
    └── 299px_10x
87
        ├── slide1.tfrecords
88
        ├── slide2.tfrecords
89
        ├── slide3.tfrecords
90
        └── ...
91
92
93
Filtering
94
*********
95
96
Datasets can be filtered through several mechanisms:
97
98
- **filters**: A dictionary, where keys are clinical annotation headers and values are the variable states which should be included. All remaining slides are removed from the dataset.
99
- **filter_blank**: A list of headers; any slide with a blank value in the clinical annotations in one of these columns will be excluded.
100
- **min_tiles**: An ``int``; any tfrecords with fewer than this number of tiles will be excluded.
101
102
Filters can be provided at the time of Dataset creation by passing to the initializer:
103
104
.. code-block:: python
105
106
    dataset = Dataset(..., filters={'HPV_status': ['negative', 'positive']})
107
108
or by using the :meth:`Dataset.filter` method:
109
110
.. code-block:: python
111
112
    dataset = dataset.filter(min_tiles=50)
113
114
Dataset Manipulation
115
********************
116
117
A number of functions can be applied to Datasets to manipulate patient filters (:meth:`Dataset.filter`, :meth:`Dataset.remove_filter`, :meth:`Dataset.clear_filters`), clip tfrecords to a maximum number of tiles (:meth:`Dataset.clip`), or prepare mini-batch balancing (:meth:`Dataset.balance`). The full documentation for these functions is given :ref:`in the API <dataset>`. Each of these manipulations return an altered copy of the dataset for easy chaining:
118
119
.. code-block:: python
120
121
    dataset = dataset.balance('HPV_status').clip(50)
122
123
Each of these manipulations is performed in memory and will not affect data stored on disk.
124
125
126
Dataset Inspection
127
******************
128
129
The fastest way to inspect a :class:`Dataset` and the dataset sources loaded, number of slides found, clinical annotation columns available, and number of tiles extracted into TFRecords is the :meth:`Dataset.summary` method.
130
131
.. code-block:: python
132
133
    dataset.summary()
134
135
.. rst-class:: sphx-glr-script-out
136
137
 .. code-block:: none
138
139
    Overview:
140
    ╒===============================================╕
141
    │ Configuration file: │ /mnt/data/datasets.json │
142
    │ Tile size (px):     │ 299                     │
143
    │ Tile size (um):     │ 10x                     │
144
    │ Slides:             │ 941                     │
145
    │ Patients:           │ 941                     │
146
    │ Slides with ROIs:   │ 941                     │
147
    │ Patients with ROIs: │ 941                     │
148
    ╘===============================================╛
149
150
    Filters:
151
    ╒====================╕
152
    │ Filters:      │ {} │
153
    ├--------------------┤
154
    │ Filter Blank: │ [] │
155
    ├--------------------┤
156
    │ Min Tiles:    │ 0  │
157
    ╘====================╛
158
159
    Sources:
160
161
    TCGA_LUNG
162
    ╒==============================================╕
163
    │ slides    │ /mnt/raid/SLIDES/TCGA_LUNG       │
164
    │ roi       │ /mnt/raid/SLIDES/TCGA_LUNG       │
165
    │ tiles     │ /mnt/rocket/tiles/TCGA_LUNG      │
166
    │ tfrecords │ /mnt/rocket/tfrecords/TCGA_LUNG/ │
167
    │ label     │ 299px_10x                        │
168
    ╘==============================================╛
169
170
    Number of tiles in TFRecords: 18354
171
    Annotation columns:
172
    Index(['patient', 'subtype', 'site', 'slide'],
173
        dtype='object')
174
175
Manifest
176
********
177
178
:meth:`Dataset.manifest` provides a dictionary mapping tfrecords to the total number of image tiles and the number of tiles after clipping or mini-batch balancing. For example, after clipping:
179
180
.. code-block:: python
181
182
    dataset = dataset.clip(500)
183
184
the manifest may look something like:
185
186
.. code-block:: json
187
188
    {
189
        "/path/tfrecord1.tfrecords":
190
        {
191
            "total": 1526,
192
            "clipped": 500
193
        },
194
        "/path/tfrecord2.tfrecords":
195
        {
196
            "total": 455,
197
            "clipped": 455
198
        }
199
    }
200
201
Inspecting a dataset's manifest may be useful to better understand the effects of dataset manipulations.
202
203
.. _validation_planning:
204
205
Training/Validation Splitting
206
*****************************
207
208
An important step when planning an experiment is to determine your validation and testing data. In total, deep learning experiments should have three groups of data:
209
210
1) **Training** - data used for learning during training
211
2) **Validation** - data used for validating training parameters and early stopping (if applicable)
212
3) **Evaluation** - held-out data used for final testing once all training and parameter tuning has completed. Preferably an external cohort.
213
214
|
215
216
Slideflow includes tools for flexible training, validation, and evaluation data planning as discussed in the next sections.
217
218
Creating a split
219
----------------
220
221
Datasets can be split into training and validation or test datasets with :meth:`Dataset.split`. The result of this function is two datasets - the first training, the second validation - each a separate instance of :class:`Dataset`.
222
223
Slideflow provides several options for preparing a validation plan, including:
224
225
- **strategy**:  ``'bootstrap'``, ``'k-fold'``, ``'k-fold-manual'``, ``'k-fold-preserved-site'``, ``'fixed'``, and ``'none'``
226
- **fraction**:  (float between 0-1) [not used for k-fold validation]
227
- **k_fold**:  int
228
229
The default validation strategy is three-fold cross-validation (``strategy='k-fold'`` and  ``k=3``).
230
231
.. code-block:: python
232
233
    # Split a dataset into training and validation
234
    # using 5-fold cross-validation, with this being
235
    # the first cross-fold.
236
    train_dataset, test_dataset = dataset.split(
237
        model_type='classification', # Categorical labels
238
        labels='subtype',            # Label to balance between datasets
239
        k_fold=5,                    # Total number of crossfolds
240
        k_fold_iter=1,               # Cross-fold iteration
241
        splits='splits.json'         # Where to save/load crossfold splits
242
    )
243
244
You can also use :meth:`Dataset.kfold_split` to iterate through cross-fold splits:
245
246
.. code-block:: python
247
248
    # Split a dataset into training and validation
249
    # using 5-fold cross-validation
250
    for train, test in dataset.kfold_split(k=5, labels='subtype'):
251
        ...
252
253
254
.. _validation_strategies:
255
256
Validation strategies
257
---------------------
258
259
.. figure:: validation.png
260
    :width: 100%
261
    :align: center
262
263
The ``strategy`` option determines how the validation data is selected.
264
265
If **fixed**, a certain percentage of your training data is set aside for testing (determined by ``fraction``).
266
267
If **bootstrap**, validation data will be selected at random (percentage determined by ``fraction``), and all training iterations will be repeated a number of times equal to ``k_fold``. When used during training, the reported model training metrics will be an average of all bootstrap iterations.
268
269
If **k-fold**, training data will be automatically separated into *k* number of groups (where *k* is equal to ``k_fold``), and all training iterations will be repeated *k* number of times using k-fold cross validation. The saved and reported model training metrics will be an average of all k-fold iterations.
270
271
Datasets can be separated into manually-curated k-folds using the **k-fold-manual** strategy. Assign each slide to a k-fold cohort in the annotations file, and designate the appropriate column header with ``k_fold_header``
272
273
The **k-fold-preserved-site** strategy is a cross-validation strategy that ensures site is preserved across the training/validation sets, in order to reduce bias from batch effect as described by `Howard, et al <https://www.nature.com/articles/s41467-021-24698-1>`_. This strategy is recommended when using data from The Cancer Genome Atlas (`TCGA <https://portal.gdc.cancer.gov/>`_).
274
275
.. note::
276
    Preserved-site cross-validation requires either `CPLEX <https://www.ibm.com/analytics/cplex-optimizer>`_ or `Pyomo/Bonmin <https://anaconda.org/conda-forge/coinbonmin>`_. The original implementation of the preserved-site cross-validation algorithm described by Howard et al can be found `on GitHub <https://github.com/fmhoward/PreservedSiteCV>`_.
277
278
If **none**, no validation testing will be performed.
279
280
Re-using splits
281
---------------
282
283
For all validation strategies, training/validation splits can be logged to a JSON file automatically if a splits configuration file is provided to the argument ``splits``. When provided, :meth:`Dataset.split` will prioritize using previously-generated training/validation splits rather than generating a new split. This aids with experiment reproducibility and hyperparameter tuning. If training/validation splits are being prepared by a :ref:`Project-level function <project>`, splits will be automatically logged to a ``splits.json`` file in the project root directory.
284
285
Creating Dataloaders
286
********************
287
288
Finally, Datasets can also return either a ``tf.data.Datasets`` or ``torch.utils.data.Dataloader`` object to quickly and easily create a deep learning dataset ready to be used as model input, with the :meth:`Dataset.tensorflow` and :meth:`Dataset.torch` methods, respectively.  See :ref:`dataloaders` for more detailed information and examples.
289
290
Datasets have many other utility functions for working with and processing data. Read more in the :ref:`Dataset API documentation <dataset>`.