Switch to unified view

a b/docs/tutorials/matching-a-terminology.md
1
# Matching a terminology
2
3
Matching a terminology is perhaps the most basic application of a medical NLP pipeline.
4
5
In this tutorial, we will cover :
6
7
- Matching a terminology using spaCy's matchers, as well as RegExps
8
- Matching on a specific attribute
9
10
You should consider reading the [matcher's specific documentation](../pipes/core/matcher.md) for a description.
11
12
!!! note "Comparison to spaCy's matcher"
13
14
    spaCy's `Matcher` and `PhraseMatcher` use a very efficient algorithm that compare a hashed representation token by token. **They are not components** by themselves, but can underpin rule-based pipes.
15
16
    EDS-NLP's [`RegexMatcher`][edsnlp.matchers.regex.RegexMatcher] lets the user match entire expressions using regular expressions. To achieve this, the matcher has to get to the text representation, match on it, and get back to spaCy's abstraction.
17
18
    The `EDSPhraseMatcher` lets EDS-NLP reuse spaCy's efficient algorithm, while adding the ability to skip pollution tokens (see the [normalizer documentation](../pipes/core/normalizer.md) for detail)
19
20
## A simple use case : finding COVID19
21
22
Let's try to find mentions of COVID19 and references to patients within a clinical note.
23
24
```python
25
import edsnlp, edsnlp.pipes as eds
26
27
text = (
28
    "Motif de prise en charge : probable pneumopathie a COVID19, "
29
    "sans difficultés respiratoires\n"
30
    "Le père du patient est asthmatique."
31
)
32
33
terms = dict(
34
    covid=["coronavirus", "covid19"],
35
    respiratoire=["asthmatique", "respiratoire"],
36
)
37
38
nlp = edsnlp.blank("eds")
39
nlp.add_pipe(eds.matcher(terms=terms))
40
41
doc = nlp(text)
42
43
doc.ents
44
# Out: (asthmatique,)
45
```
46
47
Let's unpack what happened:
48
49
1. We defined a dictionary of terms to look for, in the form `{'label': list of terms}`.
50
2. We declared a spaCy pipeline, and add the `eds.matcher` component.
51
3. We applied the pipeline to the texts...
52
4. ... and explored the extracted entities.
53
54
This example showcases a limitation of our term dictionary : the phrases `COVID19` and `difficultés respiratoires` were not detected by the pipeline.
55
56
To increase recall, we _could_ just add every possible variation :
57
58
```diff
59
terms = dict(
60
-    covid=["coronavirus", "covid19"],
61
+    covid=["coronavirus", "covid19", "COVID19"],
62
-    respiratoire=["asthmatique", "respiratoire"],
63
+    respiratoire=["asthmatique", "respiratoire", "respiratoires"],
64
)
65
```
66
67
But what if we come across `Coronavirus`? Surely we can do better!
68
69
## Matching on normalised text
70
71
We can modify the matcher's configuration to match on other attributes instead of the verbatim input. You can refer to spaCy's [list of available token attributes](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes){ target=\_blank}.
72
73
Let's focus on two:
74
75
1. The `LOWER` attribute, which lets you match on a lowercased version of the text.
76
2. The `NORM` attribute, which adds some basic normalisation (eg `œ` to `oe`). EDS-NLP provides a `eds.normalizer` component that extends the level of cleaning on the `NORM` attribute.
77
78
### The `LOWER` attribute
79
80
Matching on the lowercased version is extremely easy:
81
82
```python
83
import edsnlp, edsnlp.pipes as eds
84
85
text = (
86
    "Motif de prise en charge : probable pneumopathie a COVID19, "
87
    "sans difficultés respiratoires\n"
88
    "Le père du patient est asthmatique."
89
)
90
91
terms = dict(
92
    covid=["coronavirus", "covid19"],
93
    respiratoire=["asthmatique", "respiratoire", "respiratoires"],
94
)
95
96
nlp = edsnlp.blank("eds")
97
nlp.add_pipe(
98
    eds.matcher(
99
        terms=terms,
100
        attr="LOWER",  # (1)
101
    ),
102
)
103
104
doc = nlp(text)
105
106
doc.ents
107
# Out: (COVID19, respiratoires, asthmatique)
108
```
109
110
1. The matcher's `attr` parameter defines the attribute that the matcher will use. It is set to `"TEXT"` by default (ie verbatim text).
111
112
This code is complete, and should run as is.
113
114
### Using the normalisation component
115
116
EDS-NLP provides its own normalisation component, which modifies the `NORM` attribute in place.
117
It handles:
118
119
- removal of accentuated characters;
120
- normalisation of quotes and apostrophes;
121
- lowercasing, which enabled by default in spaCy – EDS-NLP lets you disable it;
122
- removal of pollution.
123
124
!!! note "Pollution in clinical texts"
125
126
    EDS-NLP is meant to be deployed on clinical reports extracted from
127
    hospitals information systems. As such, it is often riddled with
128
    extraction issues or administrative artifacts that "pollute" the
129
    report.
130
131
    As a core principle, EDS-NLP **never modifies the input text**,
132
    and `#!python nlp(text).text == text` is **always true**.
133
    However, we can tag some tokens as pollution elements,
134
    and avoid using them for matching the terminology.
135
136
You can activate it like any other component.
137
138
```python hl_lines="4 10 17 22 23"
139
import edsnlp, edsnlp.pipes as eds
140
141
text = (
142
    "Motif de prise en charge : probable pneumopathie a ===== COVID19, "  # (1)
143
    "sans difficultés respiratoires\n"
144
    "Le père du patient est asthmatique."
145
)
146
147
terms = dict(
148
    covid=["coronavirus", "covid19", "pneumopathie à covid19"],  # (2)
149
    respiratoire=["asthmatique", "respiratoire", "respiratoires"],
150
)
151
152
nlp = edsnlp.blank("eds")
153
154
# Add the normalisation component
155
nlp.add_pipe(eds.normalizer())  # (3)
156
157
nlp.add_pipe(
158
    eds.matcher(
159
        terms=terms,
160
        attr="NORM",  # (4)
161
        ignore_excluded=True,  # (5)
162
    ),
163
)
164
165
doc = nlp(text)
166
167
doc.ents
168
# Out: (pneumopathie a ===== COVID19, respiratoires, asthmatique)
169
```
170
171
1. We've modified the example to include a simple pollution.
172
2. We've added `pneumopathie à covid19` to the list of synonyms detected by the pipeline.
173
   Note that in the synonym we provide, we kept the accentuated `à`, whereas the example
174
   displays an unaccentuated `a`.
175
3. The component can be configured. See the [specific documentation](../pipes/core/normalizer.md) for detail.
176
4. The normalisation lives in the `NORM` attribute
177
5. We can tell the matcher to ignore excluded tokens (tokens tagged as pollution by the normalisation component).
178
   This is not an obligation.
179
180
Using the normalisation component, you can match on a normalised version of the text,
181
as well as **skip pollution tokens during the matching process**.
182
183
!!! tip "Using term matching with the normalisation"
184
185
    If you use the term matcher with the normalisation, bear in mind that the **examples go through the pipeline**.
186
    That's how the matcher was able to recover `pneumopathie a ===== COVID19` despite the fact that
187
    we used an accentuated `à` in the terminology.
188
189
    The term matcher matches the input text to the provided terminology, using the selected attribute in both cases.
190
    The `NORM` attribute that corresponds to `à` and `a` is the same: `a`.
191
192
### Preliminary conclusion
193
194
We have matched all mentions! However, we had to spell out the singular and plural form of `respiratoire`...
195
And what if we wanted to detect `covid 19`, or `covid-19` ?
196
Of course, we _could_ write out every imaginable possibility, but this will quickly become tedious.
197
198
## Using regular expressions
199
200
Let us redefine the pipeline once again, this time using regular expressions. Using regular expressions can help define richer patterns using more compact queries.
201
202
```python
203
import edsnlp, edsnlp.pipes as eds
204
205
text = (
206
    "Motif de prise en charge : probable pneumopathie a COVID19, "
207
    "sans difficultés respiratoires\n"
208
    "Le père du patient est asthmatique."
209
)
210
211
regex = dict(
212
    covid=r"(coronavirus|covid[-\s]?19)",
213
    respiratoire=r"respiratoires?",
214
)
215
terms = dict(respiratoire="asthmatique")
216
217
nlp = edsnlp.blank("eds")
218
nlp.add_pipe(
219
    eds.matcher(
220
        regex=regex,  # (1)
221
        terms=terms,  # (2)
222
        attr="LOWER",  # (3)
223
    ),
224
)
225
226
doc = nlp(text)
227
228
doc.ents
229
# Out: (COVID19, respiratoires, asthmatique)
230
```
231
232
1. We can now match using regular expressions.
233
2. We can mix and match patterns! Here we keep looking for patients using spaCy's term matching.
234
3. RegExp matching is not limited to the verbatim text! You can choose to use one of spaCy's native attribute, ignore excluded tokens, etc.
235
236
To visualize extracted entities, check out the [Visualization](/tutorials/visualization) tutorial.