a b/docs/tutorials/make-a-training-script.md
1
# Deep-learning tutorial
2
3
In this tutorial, we'll see how we can write our own deep learning model training script with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.
4
5
If you do not care about the details and just want to train a model, we suggest you to use the [training API](/tutorials/training) and move on to the next tutorial.
6
7
!!! warning "Hardware requirements"
8
9
    Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like [Google Colab](https://colab.research.google.com/), [Kaggle](https://www.kaggle.com/), [Paperspace](https://www.paperspace.com/) or [Vast.ai](https://vast.ai/).
10
11
Under the hood, EDS-NLP uses PyTorch to train deep-learning models. EDS-NLP acts as a sidekick to PyTorch, providing a set of tools to perform preprocessing, composition and evaluation. The trainable [`TorchComponents`][edsnlp.core.torch_component.TorchComponent] are actually PyTorch modules with a few extra methods to handle the feature preprocessing and postprocessing. Therefore, EDS-NLP is fully compatible with the PyTorch ecosystem.
12
13
## Step-by-step walkthrough
14
15
Training a supervised deep-learning model consists in feeding batches of annotated samples taken from a training corpus to a model and optimizing its parameters of the model to decrease its prediction
16
error. The process of training a pipeline with EDS-NLP is structured as follows:
17
18
### 1. Defining the model
19
20
We first start by seeding the random states and instantiating a new trainable pipeline composed of [trainable pipes](/pipes/trainable). The model described here computes text embeddings with a pre-trained transformer followed by a CNN, and performs
21
the NER prediction task using a Conditional Random Field (CRF) token classifier.
22
23
```python
24
import edsnlp, edsnlp.pipes as eds
25
from confit.utils.random import set_seed
26
27
set_seed(42)
28
29
nlp = edsnlp.blank("eds")
30
nlp.add_pipe(
31
    eds.ner_crf(  # (1)!
32
        mode="joint",  # (2)!
33
        target_span_getter="gold-ner",  # (3)!
34
        window=20,
35
        embedding=eds.text_cnn(  # (4)!
36
            kernel_sizes=[3],
37
            embedding=eds.transformer(  # (5)!
38
                model="prajjwal1/bert-tiny",  # (6)!
39
                window=128,
40
                stride=96,
41
            ),
42
        ),
43
    ),
44
    name="ner",
45
)
46
```
47
48
1. We use the `eds.ner_crf` NER task module, which classifies word embeddings into NER labels (BIOUL scheme) using a CRF.
49
2. Each component of the pipeline can be configured with a dictionary, using the parameter described in the component's page.
50
3. The `target_span_getter` parameter defines the name of the span group used to train the NER model. In this case, the model will look for the entities to train on in `doc.spans["gold-ner"]`. This is important because we might store entities in other span groups with a different purpose (e.g. `doc.spans["sections"]` contain the sections Spans, but we don't want to train on these). We will need to make sure the entities from the training dataset are assigned to this span group (next section).
51
4. The word embeddings used by the CRF are computed by a CNN, which builds on top of another embedding layer.
52
5. The base embedding layer is a pretrained transformer, which computes contextualized word embeddings.
53
6. We chose the `prajjwal1/bert-tiny` model in this tutorial for testing purposes, but we recommend using a larger model like `bert-base-cased` or `camembert-base` (French) for real-world applications.
54
55
### 2. Loading the raw dataset and convert it into Doc objects
56
57
To train a pipeline, we must convert our annotated data into `Doc` objects that will be either used as training samples or evaluation samples. We will assume the dataset is in [Standoff format](/data/standoff), usually produced by the [Brat](https://brat.nlplab.org) annotation tool, but any format can be used.
58
59
At this step, we might also want to perform data augmentation, filtering, splitting or any other data transformation. In this tutorial, we will split on line jumps and filter out empty documents from the training data. We will use our [Stream][edsnlp.core.stream.Stream] API to handle the data processing, but you can use any method you like, so long as you end up with a collection of `Doc` objects.
60
61
```{ .python .no-check }
62
import edsnlp
63
64
65
def skip_empty_docs(batch):
66
    for doc in batch:
67
        if len(doc.ents) > 0:
68
            yield doc
69
70
71
training_data = (
72
    edsnlp.data.read_standoff(  # (1)!
73
        train_data_path,
74
        tokenizer=nlp.tokenizer,  # (2)!
75
        span_setter=["ents", "gold-ner"],  # (3)!
76
    )
77
    .map(eds.split(regex="\n\n"))  # (4)!
78
    .map_batches(skip_empty_docs)  # (5)!
79
)
80
```
81
82
1. Read the data from the brat directory and convert it into Docs.
83
2. Tokenize the training docs with the same tokenizer as the trained model
84
3. Store the annotated Brat entities as spans in `doc.ents`, and `doc.spans["gold-ner"]`
85
4. Split the documents on line jumps.
86
5. Filter out empty documents.
87
88
As for the validation data, we will keep all the documents, even empty ones, to obtain representative metrics.
89
90
```{ .python .no-check }
91
val_data = edsnlp.data.read_standoff(
92
    val_data_path,
93
    tokenizer=nlp.tokenizer,
94
    span_setter=["ents", "gold-ner"],
95
)
96
val_docs = list(val_data)  # (1)!
97
```
98
99
1. Cache the stream result into a list of `Doc`
100
101
### 3. Complete the initialization of the model
102
103
We initialize the missing or incomplete components attributes (such as label vocabularies) with the training dataset. Indeed, when defining the model, we specified the architecture of the model, but we did not specify the types of named entities that the model will predict. This can be done either
104
105
- explicitly by setting the `labels` parameter in `eds.ner_crf` in the [definition](#1-defining-the-model) above,
106
- automatically with `post_init`: then `eds.ner_crf` looks in `doc.spans[target_span_getter]` of all docs in `training_data` to infer the labels.
107
108
```{ .python .no-check }
109
nlp.post_init(training_data)
110
```
111
112
### 4. Making the stream of mini-batches
113
114
The training dataset of `Doc` objects is then preprocessed into features to be fed to the model during the training loop. We will continue to use EDS-NLP's streams to handle the data processing :
115
116
- We first request the training data stream to loop on the input data, since we want that each example is seen multiple times during the training until a given number of steps is reached
117
118
    ??? note "Looping in EDS-NLP Streams"
119
120
        Note that in EDS-NLP, looping on a stream is always done on the input data, no matter when `loop()` is called. This means that shuffling or any further preprocessing step will be applied multiple times, each time we loop. This is usually a good thing if preprocessing contains randomness to increase the diversity of the training samples while avoiding loading multiple versions of a same document in memory. To loop after preprocessing, we can collect the stream into a list and loop on the list (`edsnlp.data.from_iterable(list(training_data)), loop=True`).
121
122
- We shuffle the data before batching to diversify the samples in each mini-batch
123
- We extract the features and labels required by each component (and sub-components) of the pipeline
124
- Finally, we group the samples into mini-batches, such that each mini-batch contains a maximum number of tokens, or any other batching criterion and assemble (or "collate") the features into tensors
125
126
```{ .python .no-check }
127
from edsnlp.utils.batching import stat_batchify
128
129
device = "cuda" if torch.cuda.is_available() else "cpu"  # (1)!
130
batches = (
131
   training_data.loop()
132
    .shuffle("dataset")  # (2)!
133
    .map(nlp.preprocess, kwargs={"supervision": True})  # (3)!
134
    .batchify(batch_size=32 * 128, batch_by=stat_batchify("tokens"))  # (4)!
135
    .map(nlp.collate, kwargs={"device": device})
136
)
137
```
138
139
1. Check if a GPU is available and set the device accordingly.
140
2. Apply shuffling to our stream. If our dataset is too large to fit in memory, instead of "dataset" we can set the shuffle batch size to "100 docs" for example, or "fragment" for parquet datasets.
141
3. This will call the `preprocess_supervised` method of the [TorchComponent][edsnlp.core.torch_component.TorchComponent] class and return a nested dictionary containing the required features and labels.
142
4. Make batches that contain at most 32 * 128 tokens (e.g. 32 samples of 128 tokens, but this accounts samples may have different lengths). We use the `stat_batchify` function to look for a key containing `tokens` in the features `stats` sub-dictionary and add samples to the batch until the sum of the `*tokens*` stats exceeds 32 * 128.
143
144
145
and that's it ! We now have a looping stream of mini-batches that we can feed to our model.
146
For better efficiency, we can also perform this in parallel in a separate worker by setting `num_cpu_workers` to 1 or more.
147
Note that streams in EDS-NLP are lazy, meaning that the execution has not started yet, and the data is not loaded in memory. This will only happen when we start iterating over the stream in the next section.
148
149
```{ .python .no-check }
150
batches = batches.set_processing(
151
   num_cpu_workers=1,
152
   process_start_method="spawn"  # (1)!
153
)
154
```
155
156
1. Since we use a GPU, we must use the "spawn" method to create the workers. This is because the default multiprocessing "fork" method is not compatible with CUDA.
157
158
### 5. The training loop
159
160
We instantiate a pytorch optimizer and start the training loop
161
162
```{ .python .no-check }
163
from itertools import chain, repeat
164
from tqdm import tqdm
165
import torch
166
167
lr = 3e-4
168
max_steps = 400
169
170
# Move the model to the GPU
171
nlp.to(device)
172
173
optimizer = torch.optim.AdamW(
174
    params=nlp.parameters(),
175
    lr=lr,
176
)
177
178
iterator = iter(batches)
179
180
for step in tqdm(range(max_steps), "Training model", leave=True):
181
    batch = next(iterator)
182
    optimizer.zero_grad()
183
```
184
185
### 6. Optimizing the weights
186
187
Inside the training loop, the trainable components are fed the collated batches from the dataloader by calling the [`TorchComponent.forward`][edsnlp.core.torch_component.TorchComponent.forward] method (via a simple call) to compute the losses. In the case we train a multitask model (not in this tutorial) and the outputs of a shared embedding are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop.
188
189
```{ .python .no-check }
190
    with nlp.cache():
191
        loss = torch.zeros((), device=device)
192
        for name, component in nlp.torch_components():
193
            output = component(batch[name])
194
            if "loss" in output:
195
                loss += output["loss"]
196
197
    loss.backward()
198
199
    optimizer.step()
200
```
201
202
### 7. Evaluating the model
203
204
Finally, the model is evaluated on the validation dataset and saved at regular intervals. We will use the `NerExactMetric` to evaluate the NER performance using Precision, Recall and F1 scores. This metric only counts an entity as correct if it matches the label and boundaries of a target entity.
205
206
```{ .python .no-check }
207
from edsnlp.metrics.ner import NerExactMetric
208
from copy import deepcopy
209
210
metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)
211
212
    ...
213
    if ((step + 1) % 100) == 0:
214
        with nlp.select_pipes(enable=["ner"]):  # (1)!
215
            preds = deepcopy(val_docs)
216
            for doc in preds:
217
                doc.ents = doc.spans["gold-ner"] = []  # (2)!
218
            preds = nlp.pipe(preds)  # (3)!
219
            print(metric(val_docs, preds))
220
221
    nlp.to_disk("model")  #(4)!
222
```
223
224
1. In the case we have multiple pipes in our model, we may want to selectively evaluate each pipe, thus we use the `select_pipes` method to disable every pipe except "ner".
225
2. Clean the documents that our model will annotate
226
3. We use the `pipe` method to run the "ner" component on the validation dataset. This method is similar to the `__call__` method of EDS-NLP components, but it is used to run a component on a list of
227
   Docs. This is also equivalent to
228
    ```{ .python .no-check }
229
    preds = (
230
        edsnlp.data
231
       .from_iterable(preds)
232
       .map_pipeline(nlp)
233
    )
234
    ```
235
4. We could also have saved the model with `torch.save(model, "model.pt")`, but `nlp.to_disk` avoids pickling and allows to inspect the model's files by saving them into a structured directory.
236
237
## Full example
238
239
Let's wrap the training code in a function, and make it callable from the command line using [confit](https://github.com/aphp/confit) !
240
241
??? example "train.py"
242
243
    ```python linenums="1"
244
    from copy import deepcopy
245
    from typing import Iterator
246
247
    import torch
248
    from confit import Cli
249
    from tqdm import tqdm
250
251
    import edsnlp
252
    import edsnlp.pipes as eds
253
    from edsnlp.metrics.ner import NerExactMetric
254
    from edsnlp.utils.batching import stat_batchify
255
256
    app = Cli(pretty_exceptions_show_locals=False)
257
258
259
    @app.command(name="train", registry=edsnlp.registry)  # (1)!
260
    def train_model(
261
        nlp: edsnlp.Pipeline,
262
        train_data_path: str,
263
        val_data_path: str,
264
        batch_size: int = 32 * 128,
265
        lr: float = 1e-4,
266
        max_steps: int = 400,
267
        num_preprocessing_workers: int = 1,
268
        evaluation_interval: int = 100,
269
    ):
270
        device = "cuda" if torch.cuda.is_available() else "cpu"
271
272
        # Define function to skip empty docs
273
        def skip_empty_docs(batch: Iterator) -> Iterator:
274
            for doc in batch:
275
                if len(doc.ents) > 0:
276
                    yield doc
277
278
        # Load and process training data
279
        training_data = (
280
            edsnlp.data.read_standoff(
281
                train_data_path,
282
                span_setter=["ents", "gold-ner"],
283
                tokenizer=nlp.tokenizer,
284
            )
285
            .map(eds.split(regex="\n\n"))
286
            .map_batches(skip_empty_docs)
287
        )
288
289
        # Load validation data
290
        val_data = edsnlp.data.read_standoff(
291
            val_data_path,
292
            span_setter=["ents", "gold-ner"],
293
            tokenizer=nlp.tokenizer,
294
        )
295
        val_docs = list(val_data)
296
297
        # Initialize components
298
        nlp.post_init(training_data)
299
300
        # Prepare the stream of batches
301
        batches = (
302
            training_data.loop()
303
            .shuffle("dataset")
304
            .map(nlp.preprocess, kwargs={"supervision": True})
305
            .batchify(batch_size=batch_size, batch_by=stat_batchify("tokens"))
306
            .map(nlp.collate, kwargs={"device": device})
307
            .set_processing(num_cpu_workers=1, process_start_method="spawn")
308
        )
309
310
        # Move the model to the GPU if available
311
        nlp.to(device)
312
313
        # Initialize optimizer
314
        optimizer = torch.optim.AdamW(params=nlp.parameters(), lr=lr)
315
316
        metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)
317
318
        # Training loop
319
        iterator = iter(batches)
320
        for step in tqdm(range(max_steps), "Training model", leave=True):
321
            batch = next(iterator)
322
            optimizer.zero_grad()
323
324
            with nlp.cache():
325
                loss = torch.zeros((), device=device)
326
                for name, component in nlp.torch_components():
327
                    output = component(batch[name])
328
                    if "loss" in output:
329
                        loss += output["loss"]
330
331
            loss.backward()
332
            optimizer.step()
333
334
            # Evaluation and model saving
335
            if ((step + 1) % evaluation_interval) == 0:
336
                with nlp.select_pipes(enable=["ner"]):
337
                    # Clean the documents that our model will annotate
338
                    preds = deepcopy(val_docs)
339
                    for doc in preds:
340
                        doc.ents = doc.spans["gold-ner"] = []
341
                    preds = nlp.pipe(preds)
342
                    print(metric(val_docs, preds))
343
344
                nlp.to_disk("model")
345
346
347
    if __name__ == "__main__":
348
        nlp = edsnlp.blank("eds")
349
        nlp.add_pipe(
350
            eds.ner_crf(
351
                mode="joint",
352
                target_span_getter="gold-ner",
353
                window=20,
354
                embedding=eds.text_cnn(
355
                    kernel_sizes=[3],
356
                    embedding=eds.transformer(
357
                        model="prajjwal1/bert-tiny",
358
                        window=128,
359
                        stride=96,
360
                    ),
361
                ),
362
            ),
363
            name="ner",
364
        )
365
        train_model(
366
            nlp,
367
            train_data_path="my_brat_data/train",
368
            val_data_path="my_brat_data/val",
369
            batch_size=32 * 128,
370
            lr=1e-4,
371
            max_steps=1000,
372
            num_preprocessing_workers=1,
373
            evaluation_interval=100,
374
        )
375
    ```
376
377
    1. This will become useful in the next section, when we will use the configuration file to define the pipeline. If you don't want to use a configuration file, you can remove this decorator.
378
379
We can now copy the above code in a notebook and run it, or call this script from the command line:
380
381
```{: data-md-color-scheme="slate" }
382
python train.py
383
```
384
385
At the end of the training, the pipeline is ready to use since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.
386
387
## Configuration
388
389
To decouple the configuration and the code of our training script, let's define a configuration file where we will describe **both** our training parameters and the pipeline. You can either write the
390
config of the pipeline by hand, or generate a pipeline config draft from an instantiated pipeline by running:
391
392
```{ .python .no-check }
393
print(nlp.config.to_yaml_str())
394
```
395
396
```yaml title="config.yml"
397
nlp:
398
  "@core": "pipeline"
399
  lang: "eds"
400
  components:
401
    ner:
402
      "@factory": "eds.ner_crf"
403
      mode: "joint"
404
      target_span_getter: "gold-ner"
405
      window: 20
406
407
      embedding:
408
        "@factory": "eds.text_cnn"
409
        kernel_sizes: [3]
410
411
        embedding:
412
          "@factory": "eds.transformer"
413
          model: "prajjwal1/bert-tiny"
414
          window: 128
415
          stride: 96
416
417
train:
418
  nlp: ${ nlp }
419
  train_data_path: my_brat_data/train
420
  val_data_path: my_brat_data/val
421
  batch_size: ${ 32 * 128 }
422
  lr: 1e-4
423
  max_steps: 400
424
  num_preprocessing_workers: 1
425
  evaluation_interval: 100
426
```
427
428
And replace the end of the script by
429
430
```{ .python .no-check }
431
if __name__ == "__main__":
432
    app.run()
433
```
434
435
That's it ! We can now call the training script with the configuration file as a parameter, and override some of its values:
436
437
```{: .shell data-md-color-scheme="slate" }
438
python train.py --config config.cfg --nlp.components.ner.embedding.embedding.transformer.window=64 --seed 43
439
```
440
441
## Going further
442
443
EDS-NLP also provides a generic training script that follows the same structure as the one we just wrote. You can learn more about in the [next Training API tutorial](/tutorials/training).
444
445
This tutorial gave you a glimpse of the training API of EDS-NLP. To build a custom trainable component, you can refer to the [TorchComponent][edsnlp.core.torch_component.TorchComponent] class or look up the implementation of [some of the trainable components on GitHub](https://github.com/aphp/edsnlp/tree/master/edsnlp/pipes/trainable).
446
447
We also recommend looking at an existing project as a reference, such as [eds-pseudo](https://github.com/aphp/eds-pseudo) or [mlg-norm](https://github.com/percevalw/mlg-norm).