Matching a terminology is perhaps the most basic application of a medical NLP pipeline.
In this tutorial, we will cover :
You should consider reading the matcher's specific documentation for a description.
!!! note "Comparison to spaCy's matcher"
spaCy's `Matcher` and `PhraseMatcher` use a very efficient algorithm that compare a hashed representation token by token. **They are not components** by themselves, but can underpin rule-based pipes.
EDS-NLP's [`RegexMatcher`][edsnlp.matchers.regex.RegexMatcher] lets the user match entire expressions using regular expressions. To achieve this, the matcher has to get to the text representation, match on it, and get back to spaCy's abstraction.
The `EDSPhraseMatcher` lets EDS-NLP reuse spaCy's efficient algorithm, while adding the ability to skip pollution tokens (see the [normalizer documentation](../pipes/core/normalizer.md) for detail)
Let's try to find mentions of COVID19 and references to patients within a clinical note.
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a COVID19, "
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
terms = dict(
covid=["coronavirus", "covid19"],
respiratoire=["asthmatique", "respiratoire"],
)
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.matcher(terms=terms))
doc = nlp(text)
doc.ents
# Out: (asthmatique,)
Let's unpack what happened:
{'label': list of terms}
.eds.matcher
component.This example showcases a limitation of our term dictionary : the phrases COVID19
and difficultés respiratoires
were not detected by the pipeline.
To increase recall, we could just add every possible variation :
terms = dict(
- covid=["coronavirus", "covid19"],
+ covid=["coronavirus", "covid19", "COVID19"],
- respiratoire=["asthmatique", "respiratoire"],
+ respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)
But what if we come across Coronavirus
? Surely we can do better!
We can modify the matcher's configuration to match on other attributes instead of the verbatim input. You can refer to spaCy's list of available token attributes{ target=_blank}.
Let's focus on two:
LOWER
attribute, which lets you match on a lowercased version of the text.NORM
attribute, which adds some basic normalisation (eg œ
to oe
). EDS-NLP provides a eds.normalizer
component that extends the level of cleaning on the NORM
attribute.LOWER
attributeMatching on the lowercased version is extremely easy:
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a COVID19, "
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
terms = dict(
covid=["coronavirus", "covid19"],
respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.matcher(
terms=terms,
attr="LOWER", # (1)
),
)
doc = nlp(text)
doc.ents
# Out: (COVID19, respiratoires, asthmatique)
attr
parameter defines the attribute that the matcher will use. It is set to "TEXT"
by default (ie verbatim text).This code is complete, and should run as is.
EDS-NLP provides its own normalisation component, which modifies the NORM
attribute in place.
It handles:
!!! note "Pollution in clinical texts"
EDS-NLP is meant to be deployed on clinical reports extracted from
hospitals information systems. As such, it is often riddled with
extraction issues or administrative artifacts that "pollute" the
report.
As a core principle, EDS-NLP **never modifies the input text**,
and `#!python nlp(text).text == text` is **always true**.
However, we can tag some tokens as pollution elements,
and avoid using them for matching the terminology.
You can activate it like any other component.
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a ===== COVID19, " # (1)
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
terms = dict(
covid=["coronavirus", "covid19", "pneumopathie à covid19"], # (2)
respiratoire=["asthmatique", "respiratoire", "respiratoires"],
)
nlp = edsnlp.blank("eds")
# Add the normalisation component
nlp.add_pipe(eds.normalizer()) # (3)
nlp.add_pipe(
eds.matcher(
terms=terms,
attr="NORM", # (4)
ignore_excluded=True, # (5)
),
)
doc = nlp(text)
doc.ents
# Out: (pneumopathie a ===== COVID19, respiratoires, asthmatique)
pneumopathie à covid19
to the list of synonyms detected by the pipeline.à
, whereas the examplea
.NORM
attributeUsing the normalisation component, you can match on a normalised version of the text,
as well as skip pollution tokens during the matching process.
!!! tip "Using term matching with the normalisation"
If you use the term matcher with the normalisation, bear in mind that the **examples go through the pipeline**.
That's how the matcher was able to recover `pneumopathie a ===== COVID19` despite the fact that
we used an accentuated `à` in the terminology.
The term matcher matches the input text to the provided terminology, using the selected attribute in both cases.
The `NORM` attribute that corresponds to `à` and `a` is the same: `a`.
We have matched all mentions! However, we had to spell out the singular and plural form of respiratoire
...
And what if we wanted to detect covid 19
, or covid-19
?
Of course, we could write out every imaginable possibility, but this will quickly become tedious.
Let us redefine the pipeline once again, this time using regular expressions. Using regular expressions can help define richer patterns using more compact queries.
import edsnlp, edsnlp.pipes as eds
text = (
"Motif de prise en charge : probable pneumopathie a COVID19, "
"sans difficultés respiratoires\n"
"Le père du patient est asthmatique."
)
regex = dict(
covid=r"(coronavirus|covid[-\s]?19)",
respiratoire=r"respiratoires?",
)
terms = dict(respiratoire="asthmatique")
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.matcher(
regex=regex, # (1)
terms=terms, # (2)
attr="LOWER", # (3)
),
)
doc = nlp(text)
doc.ents
# Out: (COVID19, respiratoires, asthmatique)
To visualize extracted entities, check out the Visualization tutorial.