Switch to unified view

a b/docs/utilities/matchers.md
1
# Matchers
2
3
We implemented three pattern matchers that are fit to clinical documents:
4
5
- the `EDSPhraseMatcher`
6
- the `RegexMatcher`
7
- the `SimstringMatcher`
8
9
However, note that for most use-cases, you should instead use the `eds.matcher` pipe that wraps these classes to annotate documents.
10
11
## EDSPhraseMatcher
12
13
The EDSPhraseMatcher lets you efficiently match large terminology lists, by comparing tokenx against a given attribute.
14
This matcher differs from the `spacy.PhraseMatcher` in that it allows to skip pollution tokens. To make it efficient, we
15
have reimplemented the matching algorithm in Cython, like the original `spacy.PhraseMatcher`.
16
17
You can use it as described in the code below.
18
19
```python
20
import edsnlp, edsnlp.pipes as eds
21
from edsnlp.matchers.phrase import EDSPhraseMatcher
22
23
nlp = edsnlp.blank("eds")
24
nlp.add_pipe(eds.normalizer())
25
doc = nlp("On ne relève pas de signe du Corona =============== virus.")
26
27
matcher = EDSPhraseMatcher(nlp.vocab, attr="NORM")
28
matcher.build_patterns(
29
    nlp,
30
    {
31
        "covid": ["corona virus", "coronavirus", "covid"],
32
        "diabete": ["diabete", "diabetique"],
33
    },
34
)
35
36
list(matcher(doc, as_spans=True))[0].text
37
# Out: Corona =============== virus
38
```
39
40
## RegexMatcher
41
42
The `RegexMatcher` performs full-text regex matching.
43
It is especially useful to handle spelling variations like `mammo-?graphies?`.
44
Like the `EDSPhraseMatcher`, this class allows to skip pollution tokens.
45
Note that this class is significantly slower than the `EDSPhraseMatcher`: if you can, try enumerating
46
lexical variations of the target phrases and feed them to the `PhraseMatcher` instead.
47
48
You can use it as described in the code below.
49
50
```python
51
import edsnlp, edsnlp.pipes as eds
52
from edsnlp.matchers.regex import RegexMatcher
53
54
nlp = edsnlp.blank("eds")
55
nlp.add_pipe(eds.normalizer())
56
doc = nlp("On ne relève pas de signe du Corona =============== virus.")
57
58
matcher = RegexMatcher(attr="NORM", ignore_excluded=True)
59
matcher.build_patterns(
60
    {
61
        "covid": ["corona[ ]*virus", "covid"],
62
        "diabete": ["diabete", "diabetique"],
63
    },
64
)
65
66
list(matcher(doc, as_spans=True))[0].text
67
# Out: Corona =============== virus
68
```
69
70
71
## SimstringMatcher
72
73
The `SimstringMatcher` performs fuzzy term matching by comparing spans of text with a
74
similarity metric. It is especially useful to handle spelling variations like
75
`paracetomol` (instead of `paracetamol`).
76
77
The [`simstring`](www.chokkan.org/software/simstring/) algorithm compares two strings by enumerating their char trigrams and
78
measuring the overlap between the two sets. In the previous example:
79
- `paracetomol` becomes `##p #pa par ara rac ace cet eto tom omo mol ol# l##`
80
- `paracetamol` becomes `##p #pa par ara rac ace cet eta tam amo mol ol# l##`
81
and the Dice (or F1) similarity between the two sets is 0.75.
82
83
Like the `EDSPhraseMatcher`, this class allows to skip pollution tokens.
84
Just like the `RegexMatcher`, this class is significantly slower than the
85
`EDSPhraseMatcher`: if you can, try enumerating lexical variations of the target phrases
86
and feed them to the `PhraseMatcher` instead.
87
88
You can use it as described in the code below.
89
90
```python
91
import edsnlp, edsnlp.pipes as eds
92
from edsnlp.matchers.simstring import SimstringMatcher
93
94
nlp = edsnlp.blank("eds")
95
nlp.add_pipe(eds.normalizer())
96
doc = nlp(
97
    "On ne relève pas de signe du corona-virus. Historique d'un hepatocellulaire carcinome."
98
)
99
100
matcher = SimstringMatcher(
101
    nlp.vocab,
102
    attr="NORM",
103
    ignore_excluded=True,
104
    measure="dice",
105
    threshold=0.75,
106
    windows=5,
107
)
108
matcher.build_patterns(
109
    nlp,
110
    {
111
        "covid": ["coronavirus", "covid"],
112
        "carcinome": ["carcinome hepatocellulaire"],
113
    },
114
)
115
116
list(matcher(doc, as_spans=True))[0].text
117
# Out: corona-virus
118
119
list(matcher(doc, as_spans=True))[1].text
120
# Out: hepatocellulaire carcinome
121
```