Switch to unified view

a b/docs-source/source/project_setup.rst
1
.. _project_setup:
2
3
Setting up a Project
4
====================
5
6
Slideflow :ref:`Projects <project>` organize datasets, annotations, and results into a unified directory and provide a high-level API for common tasks.
7
8
Use :func:`slideflow.create_project` to create a new project, supplying an annotations file (with patient labels) and path to slides. A new dataset source (collection of slides and tfrecords) will be configured. Additional keyword arguments can be used to specify the location of trecords and saved models.
9
10
.. code-block:: python
11
12
    import slideflow as sf
13
14
    P = sf.create_project(
15
      root='project_path',
16
      annotations="./annotations.csv"
17
      slides='/path/to/slides/'
18
    )
19
20
Project settings are saved in a ``settings.json`` file in the root project directory. Each project will have the following settings:
21
22
+-------------------------------+-------------------------------------------------------+
23
| **name**                      | Project name.                                         |
24
|                               | Defaults to "MyProject".                              |
25
+-------------------------------+-------------------------------------------------------+
26
| **annotations**               | Path to CSV containing annotations.                   |
27
|                               | Each line is a unique slide.                          |
28
|                               | Defaults to "./annotations.csv"                       |
29
+-------------------------------+-------------------------------------------------------+
30
| **dataset_config**            | Path to JSON file containing dataset configuration.   |
31
|                               | Defaults to "./datasets.json"                         |
32
+-------------------------------+-------------------------------------------------------+
33
| **sources**                   | Names of dataset source(s) to include in the project. |
34
|                               | Defaults to an empty list.                            |
35
+-------------------------------+-------------------------------------------------------+
36
| **models_dir**                | Path, where model files and results are saved.        |
37
|                               | Defaults to "./models"                                |
38
+-------------------------------+-------------------------------------------------------+
39
| **eval_dir**                  | Path, where model evaluation results are saved.       |
40
|                               | Defaults to "./eval"                                  |
41
+-------------------------------+-------------------------------------------------------+
42
43
Once a project has been initialized at a directory, you may then load the project with the following syntax:
44
45
.. code-block:: python
46
47
    import slideflow as sf
48
    P = sf.load_project('/path/to/project/directory')
49
50
.. _dataset_sources:
51
52
Dataset Sources
53
***************
54
55
A :ref:`dataset source <datasets_and_validation>` is a collection of slides, Regions of Interest (ROI) annotations (if available), and extracted tiles. Sources are defined in the project dataset configuration file, which can be shared and used across multiple projects or saved locally within a project directory. These configuration files have the following format:
56
57
.. code-block:: bash
58
59
    {
60
      "SOURCE":
61
      {
62
        "slides": "/directory",
63
        "roi": "/directory",
64
        "tiles": "/directory",
65
        "tfrecords": "/directory",
66
      }
67
    }
68
69
When a project is created with :func:`slideflow.create_project`, a dataset source is automatically created. You can change where slides and extracted tiles are stored by editing the project's dataset configuration file.
70
71
It is possible for a project to have multiple dataset sources - for example, you may choose to organize data from multiple institutions into separate sources. You can add a new dataset source to a project with :meth:`Project.add_source`, which will update the project dataset configuration file accordingly.
72
73
.. code-block:: python
74
75
    P.add_source(
76
      name="SOURCE_NAME",
77
      slides="/slides/directory",
78
      roi="/roi/directory",
79
      tiles="/tiles/directory",
80
      tfrecords="/tfrecords/directory"
81
    )
82
83
Read more about :ref:`working with datasets <datasets_and_validation>`.
84
85
Annotations
86
***********
87
88
Your annotations file is used to label patients and slides with clinical data and/or other outcome variables that will be used for training. Each line in the annotations file should correspond to a unique slide. Patients may have more than one slide.
89
90
The annotations file may contain any number of columns, but it must contain the following headers at minimum:
91
92
- **patient**: patient identifier
93
- **slide**: slide name / identifier (without the file extension)
94
95
An example annotations file is given below:
96
97
+-----------------------+---------------+-----------+-----------------------------------+
98
| *patient*             | *category*    | *dataset* | *slide*                           |
99
+-----------------------+---------------+-----------+-----------------------------------+
100
| TCGA-EL-A23A          | EGFR-mutant   | train     | TCGA-EL-A3CO-01Z-00-DX1-7BF5F     |
101
+-----------------------+---------------+-----------+-----------------------------------+
102
| TCGA-EL-A35B          | EGFR-mutant   | eval      | TCGA-EL-A35B-01Z-00-DX1-89FCD     |
103
+-----------------------+---------------+-----------+-----------------------------------+
104
| TCGA-EL-A26X          | non-mutant    | train     | TCGA-EL-A26X-01Z-00-DX1-4HA2C     |
105
+-----------------------+---------------+-----------+-----------------------------------+
106
| TCGA-EL-B83L          | non-mutant    | eval      | TCGA-EL-B83L-01Z-00-DX1-6BC5L     |
107
+-----------------------+---------------+-----------+-----------------------------------+
108
109
An example annotations file is generated each time a new project is initialized. To manually generate an empty annotations file that contains all detected slides, use the bundled ``Project`` function:
110
111
.. code-block:: python
112
113
    P.create_blank_annotations()
114
115
The ``slide`` column may not need to be explicitly set in the annotations file by the user. Rather, once a dataset has been set up, slideflow will search through the linked slide directories and attempt to match slides to entries in the annotations file using **patient**. Entries that are blank in the **slide** column will be auto-populated with any detected and matching slides, if available.