In this tutorial, we'll see how we can write our own deep learning model training script with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.
If you do not care about the details and just want to train a model, we suggest you to use the training API and move on to the next tutorial.
!!! warning "Hardware requirements"
Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like [Google Colab](https://colab.research.google.com/), [Kaggle](https://www.kaggle.com/), [Paperspace](https://www.paperspace.com/) or [Vast.ai](https://vast.ai/).
Under the hood, EDS-NLP uses PyTorch to train deep-learning models. EDS-NLP acts as a sidekick to PyTorch, providing a set of tools to perform preprocessing, composition and evaluation. The trainable [TorchComponents
][edsnlp.core.torch_component.TorchComponent] are actually PyTorch modules with a few extra methods to handle the feature preprocessing and postprocessing. Therefore, EDS-NLP is fully compatible with the PyTorch ecosystem.
Training a supervised deep-learning model consists in feeding batches of annotated samples taken from a training corpus to a model and optimizing its parameters of the model to decrease its prediction
error. The process of training a pipeline with EDS-NLP is structured as follows:
We first start by seeding the random states and instantiating a new trainable pipeline composed of trainable pipes. The model described here computes text embeddings with a pre-trained transformer followed by a CNN, and performs
the NER prediction task using a Conditional Random Field (CRF) token classifier.
import edsnlp, edsnlp.pipes as eds
from confit.utils.random import set_seed
set_seed(42)
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.ner_crf( # (1)!
mode="joint", # (2)!
target_span_getter="gold-ner", # (3)!
window=20,
embedding=eds.text_cnn( # (4)!
kernel_sizes=[3],
embedding=eds.transformer( # (5)!
model="prajjwal1/bert-tiny", # (6)!
window=128,
stride=96,
),
),
),
name="ner",
)
eds.ner_crf
NER task module, which classifies word embeddings into NER labels (BIOUL scheme) using a CRF.target_span_getter
parameter defines the name of the span group used to train the NER model. In this case, the model will look for the entities to train on in doc.spans["gold-ner"]
. This is important because we might store entities in other span groups with a different purpose (e.g. doc.spans["sections"]
contain the sections Spans, but we don't want to train on these). We will need to make sure the entities from the training dataset are assigned to this span group (next section).prajjwal1/bert-tiny
model in this tutorial for testing purposes, but we recommend using a larger model like bert-base-cased
or camembert-base
(French) for real-world applications.To train a pipeline, we must convert our annotated data into Doc
objects that will be either used as training samples or evaluation samples. We will assume the dataset is in Standoff format, usually produced by the Brat annotation tool, but any format can be used.
At this step, we might also want to perform data augmentation, filtering, splitting or any other data transformation. In this tutorial, we will split on line jumps and filter out empty documents from the training data. We will use our [Stream][edsnlp.core.stream.Stream] API to handle the data processing, but you can use any method you like, so long as you end up with a collection of Doc
objects.
import edsnlp
def skip_empty_docs(batch):
for doc in batch:
if len(doc.ents) > 0:
yield doc
training_data = (
edsnlp.data.read_standoff( # (1)!
train_data_path,
tokenizer=nlp.tokenizer, # (2)!
span_setter=["ents", "gold-ner"], # (3)!
)
.map(eds.split(regex="\n\n")) # (4)!
.map_batches(skip_empty_docs) # (5)!
)
doc.ents
, and doc.spans["gold-ner"]
As for the validation data, we will keep all the documents, even empty ones, to obtain representative metrics.
val_data = edsnlp.data.read_standoff(
val_data_path,
tokenizer=nlp.tokenizer,
span_setter=["ents", "gold-ner"],
)
val_docs = list(val_data) # (1)!
Doc
We initialize the missing or incomplete components attributes (such as label vocabularies) with the training dataset. Indeed, when defining the model, we specified the architecture of the model, but we did not specify the types of named entities that the model will predict. This can be done either
labels
parameter in eds.ner_crf
in the definition above,post_init
: then eds.ner_crf
looks in doc.spans[target_span_getter]
of all docs in training_data
to infer the labels.nlp.post_init(training_data)
The training dataset of Doc
objects is then preprocessed into features to be fed to the model during the training loop. We will continue to use EDS-NLP's streams to handle the data processing :
We first request the training data stream to loop on the input data, since we want that each example is seen multiple times during the training until a given number of steps is reached
??? note "Looping in EDS-NLP Streams"
Note that in EDS-NLP, looping on a stream is always done on the input data, no matter when `loop()` is called. This means that shuffling or any further preprocessing step will be applied multiple times, each time we loop. This is usually a good thing if preprocessing contains randomness to increase the diversity of the training samples while avoiding loading multiple versions of a same document in memory. To loop after preprocessing, we can collect the stream into a list and loop on the list (`edsnlp.data.from_iterable(list(training_data)), loop=True`).
We shuffle the data before batching to diversify the samples in each mini-batch
from edsnlp.utils.batching import stat_batchify
device = "cuda" if torch.cuda.is_available() else "cpu" # (1)!
batches = (
training_data.loop()
.shuffle("dataset") # (2)!
.map(nlp.preprocess, kwargs={"supervision": True}) # (3)!
.batchify(batch_size=32 * 128, batch_by=stat_batchify("tokens")) # (4)!
.map(nlp.collate, kwargs={"device": device})
)
preprocess_supervised
method of the [TorchComponent][edsnlp.core.torch_component.TorchComponent] class and return a nested dictionary containing the required features and labels.stat_batchify
function to look for a key containing tokens
in the features stats
sub-dictionary and add samples to the batch until the sum of the *tokens*
stats exceeds 32 * 128.and that's it ! We now have a looping stream of mini-batches that we can feed to our model.
For better efficiency, we can also perform this in parallel in a separate worker by setting num_cpu_workers
to 1 or more.
Note that streams in EDS-NLP are lazy, meaning that the execution has not started yet, and the data is not loaded in memory. This will only happen when we start iterating over the stream in the next section.
batches = batches.set_processing(
num_cpu_workers=1,
process_start_method="spawn" # (1)!
)
We instantiate a pytorch optimizer and start the training loop
from itertools import chain, repeat
from tqdm import tqdm
import torch
lr = 3e-4
max_steps = 400
# Move the model to the GPU
nlp.to(device)
optimizer = torch.optim.AdamW(
params=nlp.parameters(),
lr=lr,
)
iterator = iter(batches)
for step in tqdm(range(max_steps), "Training model", leave=True):
batch = next(iterator)
optimizer.zero_grad()
Inside the training loop, the trainable components are fed the collated batches from the dataloader by calling the [TorchComponent.forward
][edsnlp.core.torch_component.TorchComponent.forward] method (via a simple call) to compute the losses. In the case we train a multitask model (not in this tutorial) and the outputs of a shared embedding are reused between components, we enable caching by wrapping this step in a cache context. The training loop is otherwise carried in a similar fashion to a standard pytorch training loop.
with nlp.cache():
loss = torch.zeros((), device=device)
for name, component in nlp.torch_components():
output = component(batch[name])
if "loss" in output:
loss += output["loss"]
loss.backward()
optimizer.step()
Finally, the model is evaluated on the validation dataset and saved at regular intervals. We will use the NerExactMetric
to evaluate the NER performance using Precision, Recall and F1 scores. This metric only counts an entity as correct if it matches the label and boundaries of a target entity.
from edsnlp.metrics.ner import NerExactMetric
from copy import deepcopy
metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)
...
if ((step + 1) % 100) == 0:
with nlp.select_pipes(enable=["ner"]): # (1)!
preds = deepcopy(val_docs)
for doc in preds:
doc.ents = doc.spans["gold-ner"] = [] # (2)!
preds = nlp.pipe(preds) # (3)!
print(metric(val_docs, preds))
nlp.to_disk("model") #(4)!
select_pipes
method to disable every pipe except "ner".pipe
method to run the "ner" component on the validation dataset. This method is similar to the __call__
method of EDS-NLP components, but it is used to run a component on a list of{ .python .no-check }
preds = (
edsnlp.data
.from_iterable(preds)
.map_pipeline(nlp)
)
torch.save(model, "model.pt")
, but nlp.to_disk
avoids pickling and allows to inspect the model's files by saving them into a structured directory.Let's wrap the training code in a function, and make it callable from the command line using confit !
??? example "train.py"
```python linenums="1"
from copy import deepcopy
from typing import Iterator
import torch
from confit import Cli
from tqdm import tqdm
import edsnlp
import edsnlp.pipes as eds
from edsnlp.metrics.ner import NerExactMetric
from edsnlp.utils.batching import stat_batchify
app = Cli(pretty_exceptions_show_locals=False)
@app.command(name="train", registry=edsnlp.registry) # (1)!
def train_model(
nlp: edsnlp.Pipeline,
train_data_path: str,
val_data_path: str,
batch_size: int = 32 * 128,
lr: float = 1e-4,
max_steps: int = 400,
num_preprocessing_workers: int = 1,
evaluation_interval: int = 100,
):
device = "cuda" if torch.cuda.is_available() else "cpu"
# Define function to skip empty docs
def skip_empty_docs(batch: Iterator) -> Iterator:
for doc in batch:
if len(doc.ents) > 0:
yield doc
# Load and process training data
training_data = (
edsnlp.data.read_standoff(
train_data_path,
span_setter=["ents", "gold-ner"],
tokenizer=nlp.tokenizer,
)
.map(eds.split(regex="\n\n"))
.map_batches(skip_empty_docs)
)
# Load validation data
val_data = edsnlp.data.read_standoff(
val_data_path,
span_setter=["ents", "gold-ner"],
tokenizer=nlp.tokenizer,
)
val_docs = list(val_data)
# Initialize components
nlp.post_init(training_data)
# Prepare the stream of batches
batches = (
training_data.loop()
.shuffle("dataset")
.map(nlp.preprocess, kwargs={"supervision": True})
.batchify(batch_size=batch_size, batch_by=stat_batchify("tokens"))
.map(nlp.collate, kwargs={"device": device})
.set_processing(num_cpu_workers=1, process_start_method="spawn")
)
# Move the model to the GPU if available
nlp.to(device)
# Initialize optimizer
optimizer = torch.optim.AdamW(params=nlp.parameters(), lr=lr)
metric = NerExactMetric(span_getter=nlp.pipes.ner.target_span_getter)
# Training loop
iterator = iter(batches)
for step in tqdm(range(max_steps), "Training model", leave=True):
batch = next(iterator)
optimizer.zero_grad()
with nlp.cache():
loss = torch.zeros((), device=device)
for name, component in nlp.torch_components():
output = component(batch[name])
if "loss" in output:
loss += output["loss"]
loss.backward()
optimizer.step()
# Evaluation and model saving
if ((step + 1) % evaluation_interval) == 0:
with nlp.select_pipes(enable=["ner"]):
# Clean the documents that our model will annotate
preds = deepcopy(val_docs)
for doc in preds:
doc.ents = doc.spans["gold-ner"] = []
preds = nlp.pipe(preds)
print(metric(val_docs, preds))
nlp.to_disk("model")
if __name__ == "__main__":
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.ner_crf(
mode="joint",
target_span_getter="gold-ner",
window=20,
embedding=eds.text_cnn(
kernel_sizes=[3],
embedding=eds.transformer(
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
),
),
name="ner",
)
train_model(
nlp,
train_data_path="my_brat_data/train",
val_data_path="my_brat_data/val",
batch_size=32 * 128,
lr=1e-4,
max_steps=1000,
num_preprocessing_workers=1,
evaluation_interval=100,
)
```
1. This will become useful in the next section, when we will use the configuration file to define the pipeline. If you don't want to use a configuration file, you can remove this decorator.
We can now copy the above code in a notebook and run it, or call this script from the command line:
python train.py
At the end of the training, the pipeline is ready to use since every trained component of the pipeline is self-sufficient, ie contains the preprocessing, inference and postprocessing code required to run it.
To decouple the configuration and the code of our training script, let's define a configuration file where we will describe both our training parameters and the pipeline. You can either write the
config of the pipeline by hand, or generate a pipeline config draft from an instantiated pipeline by running:
print(nlp.config.to_yaml_str())
```yaml title="config.yml"
nlp:
"@core": "pipeline"
lang: "eds"
components:
ner:
"@factory": "eds.ner_crf"
mode: "joint"
target_span_getter: "gold-ner"
window: 20
embedding:
"@factory": "eds.text_cnn"
kernel_sizes: [3]
embedding:
"@factory": "eds.transformer"
model: "prajjwal1/bert-tiny"
window: 128
stride: 96
train:
nlp: ${ nlp }
train_data_path: my_brat_data/train
val_data_path: my_brat_data/val
batch_size: ${ 32 * 128 }
lr: 1e-4
max_steps: 400
num_preprocessing_workers: 1
evaluation_interval: 100
And replace the end of the script by
```{ .python .no-check }
if __name__ == "__main__":
app.run()
That's it ! We can now call the training script with the configuration file as a parameter, and override some of its values:
python train.py --config config.cfg --nlp.components.ner.embedding.embedding.transformer.window=64 --seed 43
EDS-NLP also provides a generic training script that follows the same structure as the one we just wrote. You can learn more about in the next Training API tutorial.
This tutorial gave you a glimpse of the training API of EDS-NLP. To build a custom trainable component, you can refer to the [TorchComponent][edsnlp.core.torch_component.TorchComponent] class or look up the implementation of some of the trainable components on GitHub.
We also recommend looking at an existing project as a reference, such as eds-pseudo or mlg-norm.