|
a |
|
b/docs/utilities/matchers.md |
|
|
1 |
# Matchers |
|
|
2 |
|
|
|
3 |
We implemented three pattern matchers that are fit to clinical documents: |
|
|
4 |
|
|
|
5 |
- the `EDSPhraseMatcher` |
|
|
6 |
- the `RegexMatcher` |
|
|
7 |
- the `SimstringMatcher` |
|
|
8 |
|
|
|
9 |
However, note that for most use-cases, you should instead use the `eds.matcher` pipe that wraps these classes to annotate documents. |
|
|
10 |
|
|
|
11 |
## EDSPhraseMatcher |
|
|
12 |
|
|
|
13 |
The EDSPhraseMatcher lets you efficiently match large terminology lists, by comparing tokenx against a given attribute. |
|
|
14 |
This matcher differs from the `spacy.PhraseMatcher` in that it allows to skip pollution tokens. To make it efficient, we |
|
|
15 |
have reimplemented the matching algorithm in Cython, like the original `spacy.PhraseMatcher`. |
|
|
16 |
|
|
|
17 |
You can use it as described in the code below. |
|
|
18 |
|
|
|
19 |
```python |
|
|
20 |
import edsnlp, edsnlp.pipes as eds |
|
|
21 |
from edsnlp.matchers.phrase import EDSPhraseMatcher |
|
|
22 |
|
|
|
23 |
nlp = edsnlp.blank("eds") |
|
|
24 |
nlp.add_pipe(eds.normalizer()) |
|
|
25 |
doc = nlp("On ne relève pas de signe du Corona =============== virus.") |
|
|
26 |
|
|
|
27 |
matcher = EDSPhraseMatcher(nlp.vocab, attr="NORM") |
|
|
28 |
matcher.build_patterns( |
|
|
29 |
nlp, |
|
|
30 |
{ |
|
|
31 |
"covid": ["corona virus", "coronavirus", "covid"], |
|
|
32 |
"diabete": ["diabete", "diabetique"], |
|
|
33 |
}, |
|
|
34 |
) |
|
|
35 |
|
|
|
36 |
list(matcher(doc, as_spans=True))[0].text |
|
|
37 |
# Out: Corona =============== virus |
|
|
38 |
``` |
|
|
39 |
|
|
|
40 |
## RegexMatcher |
|
|
41 |
|
|
|
42 |
The `RegexMatcher` performs full-text regex matching. |
|
|
43 |
It is especially useful to handle spelling variations like `mammo-?graphies?`. |
|
|
44 |
Like the `EDSPhraseMatcher`, this class allows to skip pollution tokens. |
|
|
45 |
Note that this class is significantly slower than the `EDSPhraseMatcher`: if you can, try enumerating |
|
|
46 |
lexical variations of the target phrases and feed them to the `PhraseMatcher` instead. |
|
|
47 |
|
|
|
48 |
You can use it as described in the code below. |
|
|
49 |
|
|
|
50 |
```python |
|
|
51 |
import edsnlp, edsnlp.pipes as eds |
|
|
52 |
from edsnlp.matchers.regex import RegexMatcher |
|
|
53 |
|
|
|
54 |
nlp = edsnlp.blank("eds") |
|
|
55 |
nlp.add_pipe(eds.normalizer()) |
|
|
56 |
doc = nlp("On ne relève pas de signe du Corona =============== virus.") |
|
|
57 |
|
|
|
58 |
matcher = RegexMatcher(attr="NORM", ignore_excluded=True) |
|
|
59 |
matcher.build_patterns( |
|
|
60 |
{ |
|
|
61 |
"covid": ["corona[ ]*virus", "covid"], |
|
|
62 |
"diabete": ["diabete", "diabetique"], |
|
|
63 |
}, |
|
|
64 |
) |
|
|
65 |
|
|
|
66 |
list(matcher(doc, as_spans=True))[0].text |
|
|
67 |
# Out: Corona =============== virus |
|
|
68 |
``` |
|
|
69 |
|
|
|
70 |
|
|
|
71 |
## SimstringMatcher |
|
|
72 |
|
|
|
73 |
The `SimstringMatcher` performs fuzzy term matching by comparing spans of text with a |
|
|
74 |
similarity metric. It is especially useful to handle spelling variations like |
|
|
75 |
`paracetomol` (instead of `paracetamol`). |
|
|
76 |
|
|
|
77 |
The [`simstring`](www.chokkan.org/software/simstring/) algorithm compares two strings by enumerating their char trigrams and |
|
|
78 |
measuring the overlap between the two sets. In the previous example: |
|
|
79 |
- `paracetomol` becomes `##p #pa par ara rac ace cet eto tom omo mol ol# l##` |
|
|
80 |
- `paracetamol` becomes `##p #pa par ara rac ace cet eta tam amo mol ol# l##` |
|
|
81 |
and the Dice (or F1) similarity between the two sets is 0.75. |
|
|
82 |
|
|
|
83 |
Like the `EDSPhraseMatcher`, this class allows to skip pollution tokens. |
|
|
84 |
Just like the `RegexMatcher`, this class is significantly slower than the |
|
|
85 |
`EDSPhraseMatcher`: if you can, try enumerating lexical variations of the target phrases |
|
|
86 |
and feed them to the `PhraseMatcher` instead. |
|
|
87 |
|
|
|
88 |
You can use it as described in the code below. |
|
|
89 |
|
|
|
90 |
```python |
|
|
91 |
import edsnlp, edsnlp.pipes as eds |
|
|
92 |
from edsnlp.matchers.simstring import SimstringMatcher |
|
|
93 |
|
|
|
94 |
nlp = edsnlp.blank("eds") |
|
|
95 |
nlp.add_pipe(eds.normalizer()) |
|
|
96 |
doc = nlp( |
|
|
97 |
"On ne relève pas de signe du corona-virus. Historique d'un hepatocellulaire carcinome." |
|
|
98 |
) |
|
|
99 |
|
|
|
100 |
matcher = SimstringMatcher( |
|
|
101 |
nlp.vocab, |
|
|
102 |
attr="NORM", |
|
|
103 |
ignore_excluded=True, |
|
|
104 |
measure="dice", |
|
|
105 |
threshold=0.75, |
|
|
106 |
windows=5, |
|
|
107 |
) |
|
|
108 |
matcher.build_patterns( |
|
|
109 |
nlp, |
|
|
110 |
{ |
|
|
111 |
"covid": ["coronavirus", "covid"], |
|
|
112 |
"carcinome": ["carcinome hepatocellulaire"], |
|
|
113 |
}, |
|
|
114 |
) |
|
|
115 |
|
|
|
116 |
list(matcher(doc, as_spans=True))[0].text |
|
|
117 |
# Out: corona-virus |
|
|
118 |
|
|
|
119 |
list(matcher(doc, as_spans=True))[1].text |
|
|
120 |
# Out: hepatocellulaire carcinome |
|
|
121 |
``` |