Switch to unified view

a b/docs-source/source/slide_processing.rst
1
.. _filtering:
2
3
Slide Processing
4
================
5
6
.. image:: tile_extraction_overview.png
7
8
|
9
10
Whole-slide histopathological images present many challenges for machine learning researchers, as these large gigapixel images may contain out-of-focus regions, pen marks, uneven staining, or varying optical resolutions. Slideflow provides tools for both flexible and computationally efficient slide processing in order to build datasets ready for machine learning applications.
11
12
Most tools in Slideflow work with image tiles - extracted sub-regions of a whole-slide image - as the primary data source. For efficiency, image tiles are first buffered into :ref:`TFRecords <tfrecords>` , a binary file format that greatly improves IO throughput. Although training can be performed without using TFRecords (see :ref:`from_wsi`), we recommend tile extraction as the first step for most projects.
13
14
Tile extraction
15
***************
16
17
Image tiles are extracted from whole-slide images using either :meth:`slideflow.Project.extract_tiles` or :meth:`slideflow.Dataset.extract_tiles`. When using the Project interface, the only arguments required are ``tile_px`` and ``tile_um``, which determine the size of the extracted image tiles in pixels and microns:
18
19
.. code-block:: python
20
21
    P.extract_tiles(tile_px=299, tile_um=302)
22
23
and when using a :class:`slideflow.Dataset`, no arguments are required.
24
25
.. code-block:: python
26
27
    dataset.extract_tiles()
28
29
Tiles will be extracted at the specified pixel and micron size and stored in TFRecord format. Loose image tiles (\*.jpg or \*.png format) can also be saved with the argument ``save_tiles=True``.
30
31
See the :meth:`slideflow.Dataset.extract_tiles` API documentation for customization options.
32
33
.. note::
34
35
    Slide scanners may have differing microns-per-pixel (MPP) resolutions, so "10X" magnification from one scanner may be slightly different than "10X" on another scanner. Specifying a fixed ``tile_um`` ensures all image tiles have both the same pixel size and micron size. This MPP-harmonization step uses the `Libvips resize <https://www.libvips.org/API/current/libvips-resample.html#vips-resize>`_ function on extracted images. To disable this step and instead extract tiles at a given `downsample layer <https://dicom.nema.org/dicom/dicomwsi/>`_, set ``tile_um`` equal to a magnification level rather than micron size:
36
37
    .. code-block:: python
38
39
        P.extract_tiles(tile_px=299, tile_um="10x")
40
41
Cell segmentation
42
*****************
43
44
An alternative to extracting tiles in a grid across whole-slide images is extracting tiles at detected cell centroids. This is discussed separately in :ref:`cellseg`.
45
46
.. _regions_of_interest:
47
48
Regions of Interest
49
*******************
50
51
Tile extraction can be optionally restricted based on pathologist-annotated Regions of Interest (ROI), allowing you to enrich your dataset by only using relevant sections of a slide.
52
53
We offer two methods for annotating ROIs - :ref:`Slideflow Studio <studio_roi>` and `QuPath <https://qupath.github.io/>`_. Please see the Slideflow Studio section for instructions on generating ROI annotations using the Slideflow interface.
54
55
If you are using QuPath, annotate whole-slide images using the Polygon tool. Then, click **Automate** -> **Show script editor**. In the box that comes up, click **File** -> **Open** and load the ``qupath_roi.groovy`` script (QuPath 0.2 or greater) or ``qupath_roi_legacy.groovy`` (QuPath 0.1.x), scripts `available on GitHub <https://github.com/slideflow/slideflow>`_. Click **Run** -> **Run** if using QuPath 0.2 or greater, or **Run** -> **Run for Project** if using QuPath 0.1.x. ROIs will be exported in CSV format in the QuPath project directory, in the subdirectory "ROI".
56
57
Once ROI CSV files are generated, ensure they are placed in the folder expected by your :ref:`Project <project_setup>` or :ref:`Dataset <datasets_and_validation>` based on their respective configurations.
58
59
The ``roi_method`` argument to the ``extract_tiles()`` functions allow you to control how ROIs are used. Options include:
60
61
- ``'auto'``: Default behavior. For slides with a valid ROI, extract tiles from within ROIs only. For slides without ROIs, extract from the whole-slide image.
62
- ``'inside'``: Extract from within ROIs, and skip any slides missing ROIs.
63
- ``'outside'``: Extract from outside ROIs, and skip any slides missing ROIs.
64
- ``'ignore'``: Ignore all ROIs, extracting from whole-slide images.
65
66
.. note::
67
68
    Nested ROIs will be rendered as holes.
69
70
By default, ROIs filter tiles based on the center point of the tile. Alternatively, you can filter tiles based on the proportion of the tile inside an ROI by using the argument ``roi_filter_method``. If ``roi_filter_method`` is set to a float (0-1), this value will be interpreted as a proportion threshold. If the proportion of a tile inside an ROI is greater than this number, the tile is included. For example, if ``roi_filter_method=0.7``, a tile that is 80% inside of an ROI will be included, but a tile that is only 60% inside of an ROI will be excluded.
71
72
.. image:: roi_filter.jpg
73
74
|
75
76
.. _roi_labels:
77
78
ROIs can optionally be assigned a label. Labels can be added or changed using :ref:`Slideflow Studio <studio_roi>`, or by adding a "label" column in the ROI CSV file. Labels can be used to train strongly supervised models, where each tile is assigned a label based on the ROI it is extracted from, rather than inheriting the label of the whole-slide image. See the developer note :ref:`tile_labels` for more information.
79
80
To retrieve the ROI name (and label, if present) for all tiles in a slide, use :meth:`slideflow.WSI.get_tile_dataframe`. This will return a Pandas DataFrame with the following columns:
81
82
    - **loc_x**: X-coordinate of tile center
83
    - **loc_y**: Y-coordinate of tile center
84
    - **grid_x**: X grid index of the tile
85
    - **grid_y**: Y grid index of the tile
86
    - **roi_name**: Name of the ROI if tile is in an ROI, else None
87
    - **roi_desc**: Description of the ROI if tile is in ROI, else None
88
    - **label**: ROI label, if present.
89
90
The **loc_x** and **loc_y** columns contain the same tile location information :ref:`stored in TFRecords <tfrecords>`.
91
92
You can also retrieve this information for all slides in a dataset by using :meth:`slideflow.Dataset.get_tile_dataframe`, which will return a DataFrame with the same columns as above, plus ``slide`` column.
93
94
95
Masking & Filtering
96
*******************
97
98
Slideflow provides two approaches for refining where image tiles should be extracted from whole-slide images: **slide-level masking** and **tile-level filtering**. In these next sections, we'll review options for both approaches.
99
100
Otsu's thresholding
101
-------------------
102
103
.. image:: otsu.png
104
105
|
106
107
Otsu's thresholding is a **slide-based method** that distinguishes foreground (tissue) from background (empty slide). Otsu's thresholding is performed in the HSV colorspace and yields similar results to grayspace filtering, a tile-level filtering method described below.
108
109
To apply Otsu's thresholding to slides before tile extraction, use the ``qc`` argument of the ``.extract_tiles()`` functions.
110
111
.. code-block:: python
112
113
  from slideflow.slide import qc
114
115
  # Use this QC during tile extraction
116
  P.extract_tiles(qc=qc.Otsu())
117
118
119
You can also apply Otsu's thresholding to a single slide with the :meth:`slideflow.WSI.qc` method. See :class:`the WSI API documentation <slideflow.WSI>` for more information on working with individual slides.
120
121
.. code-block:: python
122
123
  # Apply Otsu's thresholding to a WSI object
124
  wsi = sf.WSI(...)
125
  wsi.qc(qc).show()
126
127
128
Gaussian blur filtering
129
-----------------------
130
131
.. image:: blur.png
132
133
|
134
135
Gaussian blur masking is another **slide-based method** that can detect pen marks and out-of-focus areas, and is particularly useful for datasets lacking annotated Regions of Interest (ROIs). Gaussian blur masking is applied similarly, using the ``qc`` argument.
136
137
Two versions of Gaussian blur masking are available: ``qc.Gaussian`` and ``qc.GaussianV2`` (new in Slideflow 2.1.0). The latter is the default and recommended version, as it is more computationally efficient. The former is provided for backwards compatibility.
138
139
.. code-block:: python
140
141
  from slideflow.slide import qc
142
143
  # Use this QC during tile extraction
144
  P.extract_tiles(qc=qc.GaussianV2())
145
146
By default, Gaussian blur masking is calculated at 4 times lower magnification than the tile extraction MPP (e.g., when extracting tiles at 10X effective magnification, Gaussian filtering would be calculated at 2.5X). This is to reduce computation time. You can change this behavior by manually setting the ``mpp`` argument to a specific microns-per-pixel value.
147
148
Gaussian blur masking is performed on gray images. The ``sigma`` argument controls the standard deviation of the Gaussian blur kernel. The default value of 3 is recommended, but you may need to adjust this value for your dataset. A higher value will result in more areas being masked, while a lower value will result in fewer areas being masked.
149
150
.. code-block:: python
151
152
  from slideflow.slide import qc
153
154
  # Customize the Gaussian filter,
155
  # using a sigma of 2 and a mpp of 1 (10X magnification)
156
  gaussian = qc.GaussianV2(mpp=1, sigma=2)
157
158
You can also use multiple slide-level masking methods by providing a list to ``qc``.
159
160
.. code-block:: python
161
162
  from slideflow.slide import qc
163
164
  qc = [
165
    qc.Otsu(),
166
    qc.Gaussian()
167
  ]
168
  P.extract_tiles(qc=qc)
169
170
If both Otsu's thresholding and blur detection are being used, Slideflow will calculate Blur Burden, a metric used to assess the degree to which non-background tiles are either out-of-focus or contain artifact. In the tile extraction PDF report that is generated (see next section), the distribution of blur burden for slides in the dataset will be plotted on the first page. The report will contain the number of slides meeting criteria for warning, when the blur burden exceeds 5% for a given slide. A text file containing names of slides with high blur burden will be saved in the exported TFRecords directory. These slides should be manually reviewed to ensure they are of high enough quality to include in the dataset.
171
172
DeepFocus
173
---------
174
175
Slideflow also provides an interface for using `DeepFocus <https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0205387&type=printable>`_ to identify in-focus regions. DeepFocus is a lightweight neural network that predicts whether a section of a slide is in- or out-of-focus. When used as a slide-level masking method, DeepFocus will filter out-of-focus tiles from a slide. By default, DeepFocus is applied to slides at 40X magnification, although this can be customized with the ``tile_um`` argument.
176
177
.. code-block:: python
178
179
    from slideflow.slide import qc
180
181
    deepfocus = qc.DeepFocus(tile_um='20x')
182
    slide.qc(deepfocus)
183
184
Alternatively, you can also retrieve raw predictions from the DeepFocus model for a slide by calling the deepfocus object on a :class:`slideflow.WSI` object, passing the argument threshold=False:
185
186
.. code-block:: python
187
188
    preds = deepfocus(slide, threshold=False)
189
190
Custom deep learning QC
191
-----------------------
192
193
You can also create your own deep learning slide filters. To create a custom deep learning QC method like DeepFocus, create a custom slide filter that inherits :class:`slideflow.slide.qc.StridedDL`. For example, to manually recreate the above DeepFocus model, first clone the `TF2 fork on GitHub <https://github.com/jamesdolezal/deepfocus>`_, which contains the DeepFocus architecture and model weights, and create a custom class as below:
194
195
.. code-block:: python
196
197
    from slideflow.slide.qc import strided_dl
198
    from deepfocus.keras_model import load_checkpoint, deepfocus_v3
199
200
    class CustomDeepFocus(strided_dl.StridedDL):
201
202
        def __init__(self):
203
            model = deepfocus_v3()
204
            checkpoint = '/path/to/deepfocus/checkpoints/ver5'
205
            load_checkpoint(model, checkpoint)
206
            super().__init__(
207
                model=model,
208
                pred_idx=1,
209
                tile_px=64,
210
                tile_um='40x'
211
            )
212
213
Then, supply this class to the ``qc`` argument as above.
214
215
.. code-block:: python
216
217
  P.extract_tiles(qc=CustomDeepFocus())
218
219
220
See :ref:`qc` for more information on the API for further QC customization.
221
222
Segmentation Models (U-Net)
223
---------------------------
224
225
Slideflow also provides an interface for both training and using segmentation models (e.g. U-Net, FPN, DeepLabV3) for slide-level masking. This is discussed separately in :ref:`segmentation`.
226
227
Grayspace filtering
228
--------------------
229
230
Grayspace filtering is a **tile-based method** that detects the amount of grayspace in a given image tile and discards the tile if the content exceeds a set threshold. RGB image tiles are converted to the HSV spectrum, and the fraction of pixels with saturation below a certain threshold is calculated. This filtering is performed separately for each tile as it is being extracted. Relevant arguments for grayspace filtering include:
231
232
233
- ``grayspace_threshold``: Saturation value, below which a pixel is considered gray. Range 0-1. Defaults to 0.05.
234
- ``grayspace_fraction``: Image tiles with grayspace above this fraction will be discarded. Defaults to 0.6.
235
236
Grayspace filtering is enabled by default, and can be disabled by passing ``grayspace_fraction=1`` to the ``.extract_tiles()`` functions.
237
238
Grayspace filtering is similar to Otsu's thresholding, with both operating in the HSV colorspace. Otsu's thresholding is ~30% faster than grayspace filtering for slides with accessible downsample layers, but if downsample layers are not stored in a given slide or are inaccessible (e.g. ``enable_downsample=False``), grayspace filtering may be faster. Grayspace filtering is more reliable than Otsu's thresholding for slides with abundant pen marks or other artifact, which can present issues for the Otsu's thresholding algorithm.
239
240
Whitepsace filtering
241
--------------------
242
243
Whitespace filtering is performed similarly to grayspace filtering. Whitespace is calculated using overall brightness for each pixel, then counting the fraction of pixels with a brightness above some threshold. As with grayspace filtering, there are two relevant arguments:
244
245
246
- ``whitespace_threshold``: Brightness value, above which a pixel is considered white. Range 0-255. Defaults to 230.
247
- ``whitespace_fraction``: Image tiles with whitespace above this fraction will be discarded. Defaults to 1.0 (disabled).
248
249
Whitespace filtering is disabled by default.
250
251
Stain normalization
252
*******************
253
254
.. image:: norm_compare/wsi_norm_compare.jpg
255
256
Image tiles can undergo digital Hematoxylin and Eosin (H&E) stain normalization either during tile extraction or in real-time during training. Real-time normalization adds CPU overhead during training and inference but offers greater flexibility, allowing you to test different normalization strategies without re-extracting tiles from your entire dataset.
257
258
Available stain normalization algorithms include:
259
260
- **macenko**: `Original Macenko paper <https://www.cs.unc.edu/~mn/sites/default/files/macenko2009.pdf>`_.
261
- **macenko_fast**: Modified Macenko algorithm with the brightness standardization step removed.
262
- **reinhard**: `Original Reinhard paper <https://ieeexplore.ieee.org/document/946629>`_.
263
- **reinhard_fast**: Modified Reinhard algorithm with the brightness standardization step removed.
264
- **reinhard_mask**: Modified Reinhard algorithm, with background/whitespace removed.
265
- **reinhard_fast_mask**: Modified Reinhard-Fast algorithm, with background/whitespace removed.
266
- **vahadane**: `Original Vahadane paper <https://ieeexplore.ieee.org/document/7460968>`_.
267
- **augment**: HSV colorspace augmentation.
268
- **cyclegan**: CycleGAN-based stain normalization, as implemented by `Zingman et al <https://github.com/Boehringer-Ingelheim/stain-transfer>`_ (PyTorch only)
269
270
The Macenko and Reinhard stain normalizers are highly efficient, with native Tensorflow, PyTorch, and Numpy/OpenCV implementations, and support GPU acceleration (see :ref:`performance benchmarks <normalizer_performance>`).
271
272
During tile extraction
273
----------------------
274
275
Image tiles can be normalized during tile extraction by using the ``normalizer`` and ``normalizer_source`` arguments. ``normalizer`` is the name of the algorithm. The normalizer source - either a path to a reference image, or a ``str`` indicating one of our presets (e.g. ``'v1'``, ``'v2'``, ``'v3'``) - can also be set with ``normalizer_source``.
276
277
.. code-block:: python
278
279
    P.extract_tiles(
280
      tile_px=299,
281
      tile_um=302,
282
      normalizer='reinhard'
283
    )
284
285
:ref:`Contextual stain normalization <contextual_normalization>` is supported when normalizing during tile extraction.
286
287
On-the-fly
288
----------
289
290
The stain normalization implementations in Slideflow are fast and efficient, with separate Tensorflow-native, PyTorch-native, and Numpy/OpenCV implementations. In most instances, we recommend performing stain normalization on-the-fly as a part of image pre-processing, as this provides flexibility for changing normalization strategies without re-extracting all of your image tiles.
291
292
Real-time normalization can be performed by setting the ``normalizer`` and/or ``normalizer_source`` hyperparameters.
293
294
.. code-block:: python
295
296
    from slideflow.model import ModelParams
297
    hp = ModelParams(..., normalizer='reinhard')
298
299
If a model was trained using a normalizer, the normalizer algorithm and fit information will be stored in the model metadata file, ``params.json``, in the saved model folder. Any Slideflow function that uses this model will automatically process images using the same normalization strategy.
300
301
When stain normalizing on-the-fly, stain augmentation becomes available as a training augmentation technique. Read more about :ref:`stain augmentation <stain_augmentation>`.
302
303
The normalizer interfaces can also be access directly through :class:`slideflow.norm.StainNormalizer`. See :py:mod:`slideflow.norm` for examples and more information.
304
305
Performance optimization
306
************************
307
308
As tile extraction is heavily reliant on random access reading, significant performance gains can be experienced by either 1) moving all slides to an SSD, or 2) utilizing an SSD or ramdisk buffer (to which slides will be copied prior to extraction). The use of a ramdisk buffer can improve tile extraction speed by 10-fold or greater! To maximize performance, pass the buffer path to the argument ``buffer``.
309
310
Extraction reports
311
******************
312
313
Once tiles have been extracted, a PDF report will be generated with a summary and sample of tiles extracted from their corresponding slides. An example of such a report is given below. Reviewing this report may enable you to identify data corruption, artifacts with stain normalization, or suboptimal background filtering. The report is saved in the TFRecords directory.
314
315
.. image:: example_report_small.jpg
316
317
In addition to viewing reports after tile extraction, you may generate new reports on existing tfrecords with :func:`slideflow.Dataset.tfrecord_report`, by calling this function on a given dataset. For example:
318
319
.. code-block:: python
320
321
    dataset = P.dataset(tile_px=299, tile_um=302)
322
    dataset.tfrecord_report("/path/to/dest")
323
324
You can also generate reports for slides that have not yet been extracted by passing ``dry_run=True`` to :meth:`slideflow.Dataset.extract_tiles`.