|
a |
|
b/docs/concepts/pipeline.md |
|
|
1 |
# Pipeline {: #edsnlp.core.pipeline.Pipeline } |
|
|
2 |
|
|
|
3 |
The goal of EDS-NLP is to provide a **framework** for processing textual documents. |
|
|
4 |
|
|
|
5 |
Processing textual documents, and clinical documents in particular, usually involves many steps such as tokenization, cleaning, named entity recognition, span classification, normalization, linking, etc. Organising these steps together, combining static and deep learning components, while remaining modular and efficient is a challenge. This is why EDS-NLP is built on top of a **novel pipelining system**. |
|
|
6 |
|
|
|
7 |
|
|
|
8 |
!!! note "Deep learning frameworks" |
|
|
9 |
|
|
|
10 |
Trainable components in EDS-NLP are built around the PyTorch framework. While you |
|
|
11 |
can use any technology in static components, we do not provide tools to train |
|
|
12 |
components built with other deep learning frameworks. |
|
|
13 |
|
|
|
14 |
|
|
|
15 |
## Compatibility with spaCy and PyTorch |
|
|
16 |
|
|
|
17 |
While EDS-NLP is built on top of its own pipeline system, it is also designed to be compatible with the awesome [spaCy](https://spacy.io) framework. This means that you can use (non-trainable) EDS-NLP components in a spaCy pipeline, and vice-versa. Documents, objects that are passed through the pipeline, are in fact spaCy documents, and we borrow many of spaCy's method names and conventions to make the transition between the two libraries as smooth as possible. |
|
|
18 |
|
|
|
19 |
Trainable components, on the other hand, are built on top of the [PyTorch](https://pytorch.org) framework. This means that you can use PyTorch components in an EDS-NLP pipeline and benefit from the latest advances in deep learning research. For more information on PyTorch components, refer to the [Torch component](../torch-component) page. |
|
|
20 |
|
|
|
21 |
## Creating a pipeline |
|
|
22 |
|
|
|
23 |
A pipeline is composed of multiple pipes, i.e., callable processing blocks, like a function, that apply a transformation on a Doc object, such as adding annotations, and return the modified object. |
|
|
24 |
|
|
|
25 |
To create your first EDS-NLP pipeline, run the following code. We provide several ways to create a pipeline: |
|
|
26 |
|
|
|
27 |
=== "EDS-NLP API" |
|
|
28 |
|
|
|
29 |
This is the recommended way to create a pipeline, as it allows auto-completion, type checking and introspection (you can click on the component or its arguments to see the documentation in most IDEs). |
|
|
30 |
|
|
|
31 |
```python |
|
|
32 |
import edsnlp, edsnlp.pipes as eds |
|
|
33 |
|
|
|
34 |
nlp = edsnlp.blank("eds") |
|
|
35 |
nlp.add_pipe(eds.sentences()) |
|
|
36 |
nlp.add_pipe(eds.matcher(regex={"smoker": ["fume", "clope"]})) |
|
|
37 |
nlp.add_pipe(eds.negation()) |
|
|
38 |
``` |
|
|
39 |
|
|
|
40 |
!!! note "Curried components" |
|
|
41 |
|
|
|
42 |
Most components (like `eds.matcher`) require an `nlp` argument initialization. |
|
|
43 |
The above `eds.matcher(regex={"smoker": ["fume", "clope"]})` actually returns |
|
|
44 |
a ["curried"](https://en.wikipedia.org/wiki/Currying) component, that will be |
|
|
45 |
instantiated when added to the pipeline. To create the actual component directly |
|
|
46 |
and use it outside of a pipeline (not recommended), you can use |
|
|
47 |
`eds.matcher(nlp, regex={"smoker": ["fume", "clope"]})`, or use the result of |
|
|
48 |
the `nlp.add_pipe` call. |
|
|
49 |
|
|
|
50 |
=== "SpaCy-like API" |
|
|
51 |
|
|
|
52 |
Pipes can be dynamically added to the pipeline using the `add_pipe` method, with a string matching their factory name and an optional configuration dictionary. |
|
|
53 |
|
|
|
54 |
```python |
|
|
55 |
import edsnlp # or import spacy |
|
|
56 |
|
|
|
57 |
nlp = edsnlp.blank("eds") # or spacy.blank("eds") |
|
|
58 |
nlp.add_pipe("eds.sentences") |
|
|
59 |
nlp.add_pipe("eds.matcher", config=dict(regex={"smoker": ["fume", "clope"]})) |
|
|
60 |
nlp.add_pipe("eds.negation") |
|
|
61 |
``` |
|
|
62 |
|
|
|
63 |
=== "From a YAML config file" |
|
|
64 |
|
|
|
65 |
You can also create a pipeline from a configuration file. This is useful when you plan on changing the pipeline configuration often. |
|
|
66 |
|
|
|
67 |
```{ .yaml title="config.yml" } |
|
|
68 |
nlp: |
|
|
69 |
"@core": pipeline |
|
|
70 |
lang: eds |
|
|
71 |
components: |
|
|
72 |
sentences: |
|
|
73 |
"@factory": eds.sentences |
|
|
74 |
|
|
|
75 |
matcher: |
|
|
76 |
"@factory": eds.matcher |
|
|
77 |
regex: |
|
|
78 |
smoker: ["fume", "clope"] |
|
|
79 |
|
|
|
80 |
negation: |
|
|
81 |
"@factory": eds.negation |
|
|
82 |
``` |
|
|
83 |
|
|
|
84 |
and then load the pipeline with: |
|
|
85 |
|
|
|
86 |
```{ .python .no-check } |
|
|
87 |
import edsnlp |
|
|
88 |
|
|
|
89 |
nlp = edsnlp.load("config.yml") |
|
|
90 |
``` |
|
|
91 |
|
|
|
92 |
=== "From a INI config file" |
|
|
93 |
|
|
|
94 |
You can also create a pipeline from a configuration file. This is useful when you plan on changing the pipeline configuration often. |
|
|
95 |
|
|
|
96 |
```{ .cfg title="config.cfg" } |
|
|
97 |
[nlp] |
|
|
98 |
@core = "pipeline" |
|
|
99 |
lang = "eds" |
|
|
100 |
pipeline = ["sentences", "matcher", "negation"] |
|
|
101 |
|
|
|
102 |
[components.sentences] |
|
|
103 |
@factory = "eds.sentences" |
|
|
104 |
|
|
|
105 |
[components.matcher] |
|
|
106 |
@factory = "eds.matcher" |
|
|
107 |
regex = {"smoker": ["fume", "clope"]} |
|
|
108 |
|
|
|
109 |
[components.negation] |
|
|
110 |
@factory = "eds.negation" |
|
|
111 |
``` |
|
|
112 |
|
|
|
113 |
and then load the pipeline with: |
|
|
114 |
|
|
|
115 |
```{ .python .no-check } |
|
|
116 |
import edsnlp |
|
|
117 |
|
|
|
118 |
nlp = edsnlp.load("config.cfg") |
|
|
119 |
``` |
|
|
120 |
|
|
|
121 |
|
|
|
122 |
This pipeline can then be run on one or more texts documents. |
|
|
123 |
As the pipeline process documents, components will be called in the order |
|
|
124 |
they were added to the pipeline. |
|
|
125 |
|
|
|
126 |
```{ .python .no-check } |
|
|
127 |
from pathlib import Path |
|
|
128 |
|
|
|
129 |
# Processing one document |
|
|
130 |
nlp("Le patient ne fume pas") |
|
|
131 |
|
|
|
132 |
# Processing multiple documents |
|
|
133 |
nlp.pipe([text1, text2]) |
|
|
134 |
``` |
|
|
135 |
|
|
|
136 |
For more information on how to use the pipeline, refer to the [Inference](/inference) page. |
|
|
137 |
|
|
|
138 |
## Hybrid models |
|
|
139 |
|
|
|
140 |
EDS-NLP was designed to facilitate the training and inference of hybrid models that |
|
|
141 |
arbitrarily chain static components or trained deep learning components. Static components are callable objects that take a Doc object as input, perform arbitrary transformations over the input, and return the modified object. [Torch components][edsnlp.core.torch_component.TorchComponent], on the other hand, allow for deep learning operations to be performed on the Doc object and must be trained to be used. |
|
|
142 |
|
|
|
143 |
<div style="text-align: center" markdown="1"> |
|
|
144 |
|
|
|
145 |
{: style="height:150px" } |
|
|
146 |
|
|
|
147 |
</div> |
|
|
148 |
|
|
|
149 |
## Saving and loading a pipeline |
|
|
150 |
|
|
|
151 |
Pipelines can be saved and loaded using the `save` and `load` methods. Following spaCy, the saved pipeline is not a pickled objet but a folder containing the config file, the weights and extra resources for each pipeline. Deep-learning parameters are saved with the `safetensors` library to avoid any security issue. This allows for easy inspection and modification of the pipeline, and avoids the execution of arbitrary code when loading a pipeline. |
|
|
152 |
|
|
|
153 |
```{ .python .no-check } |
|
|
154 |
nlp.to_disk("path/to/your/model") |
|
|
155 |
nlp = edsnlp.load("path/to/your/model") |
|
|
156 |
``` |
|
|
157 |
|
|
|
158 |
## Sharing a pipeline |
|
|
159 |
|
|
|
160 |
To share the pipeline and turn it into a pip installable package, you can use the `package` method, which will use or create a pyproject.toml file, fill it accordingly, and create a wheel file. At the moment, we only support the poetry package manager. |
|
|
161 |
|
|
|
162 |
```{ .python .no-check } |
|
|
163 |
nlp.package( |
|
|
164 |
name="your-package-name", # leave None to reuse name in pyproject.toml |
|
|
165 |
version="0.0.1", |
|
|
166 |
root_dir="path/to/project/root", # optional, to retrieve an existing pyproject.toml file |
|
|
167 |
# if you don't have a pyproject.toml, you can provide the metadata here instead |
|
|
168 |
metadata=dict( |
|
|
169 |
authors="Firstname Lastname <your.email@domain.fr>", |
|
|
170 |
description="A short description of your package", |
|
|
171 |
), |
|
|
172 |
) |
|
|
173 |
``` |
|
|
174 |
|
|
|
175 |
This will create a wheel file in the root_dir/dist folder, which you can share and install with pip. |