a b/docs/concepts/pipeline.md
1
# Pipeline {: #edsnlp.core.pipeline.Pipeline }
2
3
The goal of EDS-NLP is to provide a **framework** for processing textual documents.
4
5
Processing textual documents, and clinical documents in particular, usually involves many steps such as tokenization, cleaning, named entity recognition, span classification, normalization, linking, etc. Organising these steps together, combining static and deep learning components, while remaining modular and efficient is a challenge. This is why EDS-NLP is built on top of a **novel pipelining system**.
6
7
8
!!! note "Deep learning frameworks"
9
10
    Trainable components in EDS-NLP are built around the PyTorch framework. While you
11
    can use any technology in static components, we do not provide tools to train
12
    components built with other deep learning frameworks.
13
14
15
## Compatibility with spaCy and PyTorch
16
17
While EDS-NLP is built on top of its own pipeline system, it is also designed to be compatible with the awesome [spaCy](https://spacy.io) framework. This means that you can use (non-trainable) EDS-NLP components in a spaCy pipeline, and vice-versa. Documents, objects that are passed through the pipeline, are in fact spaCy documents, and we borrow many of spaCy's method names and conventions to make the transition between the two libraries as smooth as possible.
18
19
Trainable components, on the other hand, are built on top of the [PyTorch](https://pytorch.org) framework. This means that you can use PyTorch components in an EDS-NLP pipeline and benefit from the latest advances in deep learning research. For more information on PyTorch components, refer to the [Torch component](../torch-component) page.
20
21
## Creating a pipeline
22
23
A pipeline is composed of multiple pipes, i.e., callable processing blocks, like a function, that apply a transformation on a Doc object, such as adding annotations, and return the modified object.
24
25
To create your first EDS-NLP pipeline, run the following code. We provide several ways to create a pipeline:
26
27
=== "EDS-NLP API"
28
29
    This is the recommended way to create a pipeline, as it allows auto-completion, type checking and introspection (you can click on the component or its arguments to see the documentation in most IDEs).
30
31
    ```python
32
    import edsnlp, edsnlp.pipes as eds
33
34
    nlp = edsnlp.blank("eds")
35
    nlp.add_pipe(eds.sentences())
36
    nlp.add_pipe(eds.matcher(regex={"smoker": ["fume", "clope"]}))
37
    nlp.add_pipe(eds.negation())
38
    ```
39
40
    !!! note "Curried components"
41
42
        Most components (like `eds.matcher`) require an `nlp` argument initialization.
43
        The above `eds.matcher(regex={"smoker": ["fume", "clope"]})` actually returns
44
        a ["curried"](https://en.wikipedia.org/wiki/Currying) component, that will be
45
        instantiated when added to the pipeline. To create the actual component directly
46
        and use it outside of a pipeline (not recommended), you can use
47
        `eds.matcher(nlp, regex={"smoker": ["fume", "clope"]})`, or use the result of
48
        the `nlp.add_pipe` call.
49
50
=== "SpaCy-like API"
51
52
    Pipes can be dynamically added to the pipeline using the `add_pipe` method, with a string matching their factory name and an optional configuration dictionary.
53
54
    ```python
55
    import edsnlp  # or import spacy
56
57
    nlp = edsnlp.blank("eds")  # or spacy.blank("eds")
58
    nlp.add_pipe("eds.sentences")
59
    nlp.add_pipe("eds.matcher", config=dict(regex={"smoker": ["fume", "clope"]}))
60
    nlp.add_pipe("eds.negation")
61
    ```
62
63
=== "From a YAML config file"
64
65
    You can also create a pipeline from a configuration file. This is useful when you plan on changing the pipeline configuration often.
66
67
    ```{ .yaml title="config.yml" }
68
    nlp:
69
      "@core": pipeline
70
      lang: eds
71
      components:
72
        sentences:
73
          "@factory": eds.sentences
74
75
        matcher:
76
          "@factory": eds.matcher
77
          regex:
78
            smoker: ["fume", "clope"]
79
80
        negation:
81
          "@factory": eds.negation
82
    ```
83
84
    and then load the pipeline with:
85
86
    ```{ .python .no-check }
87
    import edsnlp
88
89
    nlp = edsnlp.load("config.yml")
90
    ```
91
92
=== "From a INI config file"
93
94
    You can also create a pipeline from a configuration file. This is useful when you plan on changing the pipeline configuration often.
95
96
    ```{ .cfg title="config.cfg" }
97
    [nlp]
98
    @core = "pipeline"
99
    lang = "eds"
100
    pipeline = ["sentences", "matcher", "negation"]
101
102
    [components.sentences]
103
    @factory = "eds.sentences"
104
105
    [components.matcher]
106
    @factory = "eds.matcher"
107
    regex = {"smoker": ["fume", "clope"]}
108
109
    [components.negation]
110
    @factory = "eds.negation"
111
    ```
112
113
    and then load the pipeline with:
114
115
    ```{ .python .no-check }
116
    import edsnlp
117
118
    nlp = edsnlp.load("config.cfg")
119
    ```
120
121
122
This pipeline can then be run on one or more texts documents.
123
As the pipeline process documents, components will be called in the order
124
they were added to the pipeline.
125
126
```{ .python .no-check }
127
from pathlib import Path
128
129
# Processing one document
130
nlp("Le patient ne fume pas")
131
132
# Processing multiple documents
133
nlp.pipe([text1, text2])
134
```
135
136
For more information on how to use the pipeline, refer to the [Inference](/inference) page.
137
138
## Hybrid models
139
140
EDS-NLP was designed to facilitate the training and inference of hybrid models that
141
arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a Doc object as input, perform arbitrary transformations over the input, and return the modified object. [Torch components][edsnlp.core.torch_component.TorchComponent], on the other hand, allow for deep learning operations to be performed on the Doc object and must be trained to be used.
142
143
<div style="text-align: center" markdown="1">
144
145
![Example of a hybrid pipeline](/assets/images/hybrid-pipeline-example.png){: style="height:150px" }
146
147
</div>
148
149
## Saving and loading a pipeline
150
151
Pipelines can be saved and loaded using the `save` and `load` methods. Following spaCy, the saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. Deep-learning parameters are saved with the `safetensors` library to avoid any security issue. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline.
152
153
```{ .python .no-check }
154
nlp.to_disk("path/to/your/model")
155
nlp = edsnlp.load("path/to/your/model")
156
```
157
158
## Sharing a pipeline
159
160
To share the pipeline and turn it into a pip installable package, you can use the `package` method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager.
161
162
```{ .python .no-check }
163
nlp.package(
164
    name="your-package-name",  # leave None to reuse name in pyproject.toml
165
    version="0.0.1",
166
    root_dir="path/to/project/root",  # optional, to retrieve an existing pyproject.toml file
167
    # if you don't have a pyproject.toml, you can provide the metadata here instead
168
    metadata=dict(
169
        authors="Firstname Lastname <your.email@domain.fr>",
170
        description="A short description of your package",
171
    ),
172
)
173
```
174
175
This will create a wheel file in the root_dir/dist folder, which you can share and install with pip.