|
a |
|
b/docs/tutorials/matching-a-terminology.md |
|
|
1 |
# Matching a terminology |
|
|
2 |
|
|
|
3 |
Matching a terminology is perhaps the most basic application of a medical NLP pipeline. |
|
|
4 |
|
|
|
5 |
In this tutorial, we will cover : |
|
|
6 |
|
|
|
7 |
- Matching a terminology using spaCy's matchers, as well as RegExps |
|
|
8 |
- Matching on a specific attribute |
|
|
9 |
|
|
|
10 |
You should consider reading the [matcher's specific documentation](../pipes/core/matcher.md) for a description. |
|
|
11 |
|
|
|
12 |
!!! note "Comparison to spaCy's matcher" |
|
|
13 |
|
|
|
14 |
spaCy's `Matcher` and `PhraseMatcher` use a very efficient algorithm that compare a hashed representation token by token. **They are not components** by themselves, but can underpin rule-based pipes. |
|
|
15 |
|
|
|
16 |
EDS-NLP's [`RegexMatcher`][edsnlp.matchers.regex.RegexMatcher] lets the user match entire expressions using regular expressions. To achieve this, the matcher has to get to the text representation, match on it, and get back to spaCy's abstraction. |
|
|
17 |
|
|
|
18 |
The `EDSPhraseMatcher` lets EDS-NLP reuse spaCy's efficient algorithm, while adding the ability to skip pollution tokens (see the [normalizer documentation](../pipes/core/normalizer.md) for detail) |
|
|
19 |
|
|
|
20 |
## A simple use case : finding COVID19 |
|
|
21 |
|
|
|
22 |
Let's try to find mentions of COVID19 and references to patients within a clinical note. |
|
|
23 |
|
|
|
24 |
```python |
|
|
25 |
import edsnlp, edsnlp.pipes as eds |
|
|
26 |
|
|
|
27 |
text = ( |
|
|
28 |
"Motif de prise en charge : probable pneumopathie a COVID19, " |
|
|
29 |
"sans difficultés respiratoires\n" |
|
|
30 |
"Le père du patient est asthmatique." |
|
|
31 |
) |
|
|
32 |
|
|
|
33 |
terms = dict( |
|
|
34 |
covid=["coronavirus", "covid19"], |
|
|
35 |
respiratoire=["asthmatique", "respiratoire"], |
|
|
36 |
) |
|
|
37 |
|
|
|
38 |
nlp = edsnlp.blank("eds") |
|
|
39 |
nlp.add_pipe(eds.matcher(terms=terms)) |
|
|
40 |
|
|
|
41 |
doc = nlp(text) |
|
|
42 |
|
|
|
43 |
doc.ents |
|
|
44 |
# Out: (asthmatique,) |
|
|
45 |
``` |
|
|
46 |
|
|
|
47 |
Let's unpack what happened: |
|
|
48 |
|
|
|
49 |
1. We defined a dictionary of terms to look for, in the form `{'label': list of terms}`. |
|
|
50 |
2. We declared a spaCy pipeline, and add the `eds.matcher` component. |
|
|
51 |
3. We applied the pipeline to the texts... |
|
|
52 |
4. ... and explored the extracted entities. |
|
|
53 |
|
|
|
54 |
This example showcases a limitation of our term dictionary : the phrases `COVID19` and `difficultés respiratoires` were not detected by the pipeline. |
|
|
55 |
|
|
|
56 |
To increase recall, we _could_ just add every possible variation : |
|
|
57 |
|
|
|
58 |
```diff |
|
|
59 |
terms = dict( |
|
|
60 |
- covid=["coronavirus", "covid19"], |
|
|
61 |
+ covid=["coronavirus", "covid19", "COVID19"], |
|
|
62 |
- respiratoire=["asthmatique", "respiratoire"], |
|
|
63 |
+ respiratoire=["asthmatique", "respiratoire", "respiratoires"], |
|
|
64 |
) |
|
|
65 |
``` |
|
|
66 |
|
|
|
67 |
But what if we come across `Coronavirus`? Surely we can do better! |
|
|
68 |
|
|
|
69 |
## Matching on normalised text |
|
|
70 |
|
|
|
71 |
We can modify the matcher's configuration to match on other attributes instead of the verbatim input. You can refer to spaCy's [list of available token attributes](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes){ target=\_blank}. |
|
|
72 |
|
|
|
73 |
Let's focus on two: |
|
|
74 |
|
|
|
75 |
1. The `LOWER` attribute, which lets you match on a lowercased version of the text. |
|
|
76 |
2. The `NORM` attribute, which adds some basic normalisation (eg `œ` to `oe`). EDS-NLP provides a `eds.normalizer` component that extends the level of cleaning on the `NORM` attribute. |
|
|
77 |
|
|
|
78 |
### The `LOWER` attribute |
|
|
79 |
|
|
|
80 |
Matching on the lowercased version is extremely easy: |
|
|
81 |
|
|
|
82 |
```python |
|
|
83 |
import edsnlp, edsnlp.pipes as eds |
|
|
84 |
|
|
|
85 |
text = ( |
|
|
86 |
"Motif de prise en charge : probable pneumopathie a COVID19, " |
|
|
87 |
"sans difficultés respiratoires\n" |
|
|
88 |
"Le père du patient est asthmatique." |
|
|
89 |
) |
|
|
90 |
|
|
|
91 |
terms = dict( |
|
|
92 |
covid=["coronavirus", "covid19"], |
|
|
93 |
respiratoire=["asthmatique", "respiratoire", "respiratoires"], |
|
|
94 |
) |
|
|
95 |
|
|
|
96 |
nlp = edsnlp.blank("eds") |
|
|
97 |
nlp.add_pipe( |
|
|
98 |
eds.matcher( |
|
|
99 |
terms=terms, |
|
|
100 |
attr="LOWER", # (1) |
|
|
101 |
), |
|
|
102 |
) |
|
|
103 |
|
|
|
104 |
doc = nlp(text) |
|
|
105 |
|
|
|
106 |
doc.ents |
|
|
107 |
# Out: (COVID19, respiratoires, asthmatique) |
|
|
108 |
``` |
|
|
109 |
|
|
|
110 |
1. The matcher's `attr` parameter defines the attribute that the matcher will use. It is set to `"TEXT"` by default (ie verbatim text). |
|
|
111 |
|
|
|
112 |
This code is complete, and should run as is. |
|
|
113 |
|
|
|
114 |
### Using the normalisation component |
|
|
115 |
|
|
|
116 |
EDS-NLP provides its own normalisation component, which modifies the `NORM` attribute in place. |
|
|
117 |
It handles: |
|
|
118 |
|
|
|
119 |
- removal of accentuated characters; |
|
|
120 |
- normalisation of quotes and apostrophes; |
|
|
121 |
- lowercasing, which enabled by default in spaCy – EDS-NLP lets you disable it; |
|
|
122 |
- removal of pollution. |
|
|
123 |
|
|
|
124 |
!!! note "Pollution in clinical texts" |
|
|
125 |
|
|
|
126 |
EDS-NLP is meant to be deployed on clinical reports extracted from |
|
|
127 |
hospitals information systems. As such, it is often riddled with |
|
|
128 |
extraction issues or administrative artifacts that "pollute" the |
|
|
129 |
report. |
|
|
130 |
|
|
|
131 |
As a core principle, EDS-NLP **never modifies the input text**, |
|
|
132 |
and `#!python nlp(text).text == text` is **always true**. |
|
|
133 |
However, we can tag some tokens as pollution elements, |
|
|
134 |
and avoid using them for matching the terminology. |
|
|
135 |
|
|
|
136 |
You can activate it like any other component. |
|
|
137 |
|
|
|
138 |
```python hl_lines="4 10 17 22 23" |
|
|
139 |
import edsnlp, edsnlp.pipes as eds |
|
|
140 |
|
|
|
141 |
text = ( |
|
|
142 |
"Motif de prise en charge : probable pneumopathie a ===== COVID19, " # (1) |
|
|
143 |
"sans difficultés respiratoires\n" |
|
|
144 |
"Le père du patient est asthmatique." |
|
|
145 |
) |
|
|
146 |
|
|
|
147 |
terms = dict( |
|
|
148 |
covid=["coronavirus", "covid19", "pneumopathie à covid19"], # (2) |
|
|
149 |
respiratoire=["asthmatique", "respiratoire", "respiratoires"], |
|
|
150 |
) |
|
|
151 |
|
|
|
152 |
nlp = edsnlp.blank("eds") |
|
|
153 |
|
|
|
154 |
# Add the normalisation component |
|
|
155 |
nlp.add_pipe(eds.normalizer()) # (3) |
|
|
156 |
|
|
|
157 |
nlp.add_pipe( |
|
|
158 |
eds.matcher( |
|
|
159 |
terms=terms, |
|
|
160 |
attr="NORM", # (4) |
|
|
161 |
ignore_excluded=True, # (5) |
|
|
162 |
), |
|
|
163 |
) |
|
|
164 |
|
|
|
165 |
doc = nlp(text) |
|
|
166 |
|
|
|
167 |
doc.ents |
|
|
168 |
# Out: (pneumopathie a ===== COVID19, respiratoires, asthmatique) |
|
|
169 |
``` |
|
|
170 |
|
|
|
171 |
1. We've modified the example to include a simple pollution. |
|
|
172 |
2. We've added `pneumopathie à covid19` to the list of synonyms detected by the pipeline. |
|
|
173 |
Note that in the synonym we provide, we kept the accentuated `à`, whereas the example |
|
|
174 |
displays an unaccentuated `a`. |
|
|
175 |
3. The component can be configured. See the [specific documentation](../pipes/core/normalizer.md) for detail. |
|
|
176 |
4. The normalisation lives in the `NORM` attribute |
|
|
177 |
5. We can tell the matcher to ignore excluded tokens (tokens tagged as pollution by the normalisation component). |
|
|
178 |
This is not an obligation. |
|
|
179 |
|
|
|
180 |
Using the normalisation component, you can match on a normalised version of the text, |
|
|
181 |
as well as **skip pollution tokens during the matching process**. |
|
|
182 |
|
|
|
183 |
!!! tip "Using term matching with the normalisation" |
|
|
184 |
|
|
|
185 |
If you use the term matcher with the normalisation, bear in mind that the **examples go through the pipeline**. |
|
|
186 |
That's how the matcher was able to recover `pneumopathie a ===== COVID19` despite the fact that |
|
|
187 |
we used an accentuated `à` in the terminology. |
|
|
188 |
|
|
|
189 |
The term matcher matches the input text to the provided terminology, using the selected attribute in both cases. |
|
|
190 |
The `NORM` attribute that corresponds to `à` and `a` is the same: `a`. |
|
|
191 |
|
|
|
192 |
### Preliminary conclusion |
|
|
193 |
|
|
|
194 |
We have matched all mentions! However, we had to spell out the singular and plural form of `respiratoire`... |
|
|
195 |
And what if we wanted to detect `covid 19`, or `covid-19` ? |
|
|
196 |
Of course, we _could_ write out every imaginable possibility, but this will quickly become tedious. |
|
|
197 |
|
|
|
198 |
## Using regular expressions |
|
|
199 |
|
|
|
200 |
Let us redefine the pipeline once again, this time using regular expressions. Using regular expressions can help define richer patterns using more compact queries. |
|
|
201 |
|
|
|
202 |
```python |
|
|
203 |
import edsnlp, edsnlp.pipes as eds |
|
|
204 |
|
|
|
205 |
text = ( |
|
|
206 |
"Motif de prise en charge : probable pneumopathie a COVID19, " |
|
|
207 |
"sans difficultés respiratoires\n" |
|
|
208 |
"Le père du patient est asthmatique." |
|
|
209 |
) |
|
|
210 |
|
|
|
211 |
regex = dict( |
|
|
212 |
covid=r"(coronavirus|covid[-\s]?19)", |
|
|
213 |
respiratoire=r"respiratoires?", |
|
|
214 |
) |
|
|
215 |
terms = dict(respiratoire="asthmatique") |
|
|
216 |
|
|
|
217 |
nlp = edsnlp.blank("eds") |
|
|
218 |
nlp.add_pipe( |
|
|
219 |
eds.matcher( |
|
|
220 |
regex=regex, # (1) |
|
|
221 |
terms=terms, # (2) |
|
|
222 |
attr="LOWER", # (3) |
|
|
223 |
), |
|
|
224 |
) |
|
|
225 |
|
|
|
226 |
doc = nlp(text) |
|
|
227 |
|
|
|
228 |
doc.ents |
|
|
229 |
# Out: (COVID19, respiratoires, asthmatique) |
|
|
230 |
``` |
|
|
231 |
|
|
|
232 |
1. We can now match using regular expressions. |
|
|
233 |
2. We can mix and match patterns! Here we keep looking for patients using spaCy's term matching. |
|
|
234 |
3. RegExp matching is not limited to the verbatim text! You can choose to use one of spaCy's native attribute, ignore excluded tokens, etc. |
|
|
235 |
|
|
|
236 |
To visualize extracted entities, check out the [Visualization](/tutorials/visualization) tutorial. |