Diff of /README.md [000000] .. [13a70a]

Switch to unified view

a b/README.md
1
<h1>
2
  <picture>
3
    <source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-deepmodeloptim_logo_dark.png">
4
    <img alt="nf-core/deepmodeloptim" src="docs/images/nf-core-deepmodeloptim_logo_light.png">
5
  </picture>
6
</h1>
7
8
[![GitHub Actions CI Status](https://github.com/nf-core/deepmodeloptim/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/deepmodeloptim/actions/workflows/ci.yml)
9
[![GitHub Actions Linting Status](https://github.com/nf-core/deepmodeloptim/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/deepmodeloptim/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/deepmodeloptim/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
10
[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
11
12
[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.2-23aa62.svg)](https://www.nextflow.io/)
13
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
14
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
15
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
16
[![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/deepmodeloptim)
17
18
[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23deepmodeloptim-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/deepmodeloptim)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
19
20
## 📌 **Quick intro** check out this 👉🏻 [video](https://www.youtube.com/watch?v=dC5p_tXQpEs&list=PLPZ8WHdZGxmVKQga4KE15YVt95i-QXVvE&index=25)!
21
22
## Introduction
23
24
**nf-core/deepmodeloptim** is a bioinformatics end-to-end pipeline designed to facilitate the testing and development of deep learning models for genomics.
25
26
Deep learning model development in natural science is an empirical and costly process. Despite the existence of generic tools for the tuning of hyperparameters and the training of the models, the connection between these procedures and the impact coming from the data is often underlooked, or at least not easily automatized. Indeed, researchers must define a pre-processing pipeline, an architecture, find the best parameters for said architecture and iterate over this process, often manually.
27
28
Leveraging the power of Nextflow (polyglotism, container integration, scalable on the cloud), this pipeline will help users to 1) automatize the testing of the model, 2) gain useful insights with respect to the learning behaviour of the model, and hence 3) accelerate the development.
29
30
## Pipeline summary
31
32
It takes as input:
33
34
- A dataset
35
- A configuration file to describe the data pre-processing steps to be performed
36
- An user defined PyTorch model
37
- A configuration file describing the range of parameters for the PyTorch model
38
39
It then transforms the data according to all possible pre-processing steps, finds the best architecture parameters for each of the transformed datasets, performs sanity checks on the models and train a minimal deep learning version for each dataset/architecture.
40
41
Those experiments are then compiled into an intuitive report, making it easier for scientists to pick the best design choice to be sent to large scale training.
42
43
<picture>
44
  <source media="(prefers-color-scheme: dark)" srcset="assets/metromap.png">
45
  <img alt="nf-core/deepmodeloptim metro map" src="assets/metromap_light.png">
46
</picture>
47
48
## Usage
49
50
> [!NOTE]
51
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
52
53
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
54
     Explain what rows and columns represent. For instance (please edit as appropriate):
55
56
First, prepare a samplesheet with your input data that looks as follows:
57
58
`samplesheet.csv`:
59
60
```csv
61
sample,fastq_1,fastq_2
62
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
63
```
64
65
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
66
67
-->
68
69
Now, you can run the pipeline using:
70
71
<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
72
73
```bash
74
nextflow run nf-core/deepmodeloptim \
75
   -profile <docker/singularity/.../institute> \
76
   --input samplesheet.csv \
77
   --outdir <OUTDIR>
78
```
79
80
> [!WARNING]
81
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
82
83
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/deepmodeloptim/usage) and the [parameter documentation](https://nf-co.re/deepmodeloptim/parameters).
84
85
## Pipeline output
86
87
To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/deepmodeloptim/results) tab on the nf-core website pipeline page.
88
For more details about the output files and reports, please refer to the
89
[output documentation](https://nf-co.re/deepmodeloptim/output).
90
91
<!-- TODO
92
 Reconciliate previous readme with a nf-core format one.
93
-->
94
95
## Code requirements
96
97
### Data
98
99
The data is provided as a csv where the header columns are in the following format : name:type:class
100
101
_name_ is user given (note that it has an impact on experiment definition).
102
103
_type_ is either "input", "meta", or "label". "input" types are fed into the mode, "meta" types are registered but not transformed nor fed into the models and "label" is used as a training label.
104
105
_class_ is a supported class of data for which encoding methods have been created, please raise an issue on github or contribute a PR if a class of your interest is not implemented
106
107
#### csv general example
108
109
| input1:input:input_type | input2:input:input_type | meta1:meta:meta_type | label1:label:label_type | label2:label:label_type |
110
| ----------------------- | ----------------------- | -------------------- | ----------------------- | ----------------------- |
111
| sample1 input1          | sample1 input2          | sample1 meta1        | sample1 label1          | sample1 label2          |
112
| sample2 input1          | sample2 input2          | sample2 meta1        | sample2 label1          | sample2 label2          |
113
| sample3 input1          | sample3 input2          | sample3 meta1        | sample3 label1          | sample3 label2          |
114
115
#### csv specific example
116
117
| mouse_dna:input:dna | mouse_rnaseq:label:float |
118
| ------------------- | ------------------------ |
119
| ACTAGGCATGCTAGTCG   | 0.53                     |
120
| ACTGGGGCTAGTCGAA    | 0.23                     |
121
| GATGTTCTGATGCT      | 0.98                     |
122
123
### Model
124
125
In STIMULUS, users input a .py file containing a model written in pytorch (see examples in bin/tests/models)
126
127
Said models should obey to minor standards:
128
129
1. The model class you want to train should start with "Model", there should be exactly one class starting with "Model".
130
131
```python
132
133
import torch
134
import torch.nn as nn
135
136
class SubClass(nn.Module):
137
    """
138
    a subclass, this will be invisible to Stimulus
139
    """
140
141
class ModelClass(nn.Module):
142
    """
143
    the PyTorch model to be trained by Stimulus, can use SubClass if needed
144
    """
145
146
class ModelAnotherClass(nn.Module):
147
    """
148
    uh oh, this will return an error as there are two classes starting with Model
149
    """
150
151
```
152
153
2. The model "forward" function should have input variables with the **same names** as the defined input names in the csv input file
154
155
```python
156
157
import torch
158
import torch.nn as nn
159
160
class ModelClass(nn.Module):
161
    """
162
    the PyTorch model to be trained by Stimulus
163
    """
164
    def __init__():
165
        # your model definition here
166
        pass
167
168
    def forward(self, mouse_dna):
169
        output = model_layers(mouse_dna)
170
171
```
172
173
3. The model should include a **batch** named function that takes as input a dictionary of input "x", a dictionary of labels "y", a Callable loss function and a callable optimizer.
174
175
In order to allow **batch** to take as input a Callable loss, we define an extra compute_loss function that parses the correct output to the correct loss class.
176
177
```python
178
179
import torch
180
import torch.nn as nn
181
from typing import Callable, Optional, Tuple
182
183
class ModelClass(nn.Module):
184
    """
185
    the PyTorch model to be trained by Stimulus
186
    """
187
188
    def __init__():
189
        # your model definition here
190
        pass
191
192
    def forward(self, mouse_dna):
193
        output = model_layers(mouse_dna)
194
195
    def compute_loss_mouse_rnaseq(self, output: torch.Tensor, mouse_rnaseq: torch.Tensor, loss_fn: Callable) -> torch.Tensor:
196
        """
197
        Compute the loss.
198
        `output` is the output tensor of the forward pass.
199
        `mouse_rnaseq` is the target tensor -> label column name.
200
        `loss_fn` is the loss function to be used.
201
202
        IMPORTANT : the input variable "mouse_rnaseq" has the same name as the label defined in the csv above.
203
        """
204
        return loss_fn(output, mouse_rnaseq)
205
206
    def batch(self, x: dict, y: dict, loss_fn: Callable, optimizer: Optional[Callable] = None) -> Tuple[torch.Tensor, dict]:
207
        """
208
        Perform one batch step.
209
        `x` is a dictionary with the input tensors.
210
        `y` is a dictionary with the target tensors.
211
        `loss_fn` is the loss function to be used.
212
213
        If `optimizer` is passed, it will perform the optimization step -> training step
214
        Otherwise, only return the forward pass output and loss -> evaluation step
215
        """
216
        output = self.forward(**x)
217
        loss = self.compute_loss_mouse_rnaseq(output, **y, loss_fn=loss_fn)
218
        if optimizer is not None:
219
            optimizer.zero_grad()
220
            loss.backward()
221
            optimizer.step()
222
        return loss, output
223
224
```
225
226
If you don't want to optimize the loss function, the code above can be written in a simplified manner
227
228
```python
229
230
import torch
231
import torch.nn as nn
232
from typing import Callable, Optional, Tuple
233
234
class ModelClass(nn.Module):
235
    """
236
    the PyTorch model to be trained by Stimulus
237
    """
238
239
    def __init__():
240
        # your model definition here
241
        pass
242
243
    def forward(self, mouse_dna):
244
        output = model_layers(mouse_dna)
245
246
    def batch(self, x: dict, y: dict, optimizer: Optional[Callable] = None) -> Tuple[torch.Tensor, dict]:
247
        """
248
        Perform one batch step.
249
        `x` is a dictionary with the input tensors.
250
        `y` is a dictionary with the target tensors.
251
        `loss_fn` is the loss function to be used.
252
253
        If `optimizer` is passed, it will perform the optimization step -> training step
254
        Otherwise, only return the forward pass output and loss -> evaluation step
255
        """
256
        output = self.forward(**x)
257
        loss = nn.MSELoss(output, y['mouse_rnaseq'])
258
        if optimizer is not None:
259
            optimizer.zero_grad()
260
            loss.backward()
261
            optimizer.step()
262
        return loss, output
263
264
```
265
266
### Model parameter search design
267
268
### Experiment design
269
270
The file in which all information about how to handle the data before tuning is called an `experiment_config`. This file in `.json` format for now but it will be soon moved to `.yaml`. So this section could vary in the future.
271
272
The `experiment_config` is a mandatory input for the pipeline and can be passed with the flag `--exp_conf` followed by the `PATH` of the file you want to use. Two examples of `experiment_config` can be found in the `examples` directory.
273
274
### Experiment config content description.
275
276
## Credits
277
278
<!-- TODO
279
    Update the author list
280
-->
281
282
nf-core/deepmodeloptim was originally written by Mathys Grapotte ([@mathysgrapotte](https://github.com/mathysgrapotte)).
283
284
We would like to thank to all the contributors for their extensive assistance in the development of this pipeline, who include (but not limited to):
285
286
- Alessio Vignoli ([@alessiovignoli](https://github.com/alessiovignoli))
287
- Suzanne Jin ([@suzannejin](https://github.com/suzannejin))
288
- Luisa Santus ([@luisas](https://github.com/luisas))
289
- Jose Espinosa ([@JoseEspinosa](https://github.com/JoseEspinosa))
290
- Evan Floden ([@evanfloden](https://github.com/evanfloden))
291
- Igor Trujnara ([@itrujnara](https://github.com/itrujnara))
292
293
Special thanks for the artistic work on the logo to Maxime ([@maxulysse](https://github.com/maxulysse)), Suzanne ([@suzannejin](https://github.com/suzannejin)), Mathys ([@mathysgrapotte](https://github.com/mathysgrapotte)) and, not surprisingly, ChatGPT.
294
295
<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
296
297
## Contributions and Support
298
299
If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
300
301
For further information or help, don't hesitate to get in touch on the [Slack `#deepmodeloptim` channel](https://nfcore.slack.com/channels/deepmodeloptim) (you can join with [this invite](https://nf-co.re/join/slack)).
302
303
## Citations
304
305
<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
306
<!-- If you use nf-core/deepmodeloptim for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
307
308
<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
309
310
An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
311
312
You can cite the `nf-core` publication as follows:
313
314
> **The nf-core framework for community-curated bioinformatics pipelines.**
315
>
316
> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
317
>
318
> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).