edsnlp / Git / Diff of /docs/utilities/matchers.md

Models:

philipB/

edsnlp

Downloads: 1

Diff of /docs/utilities/matchers.md [000000] .. [cad161]

Switch to unified view

 b/docs/utilities/matchers.md
+# Matchers
+We implemented three pattern matchers that are fit to clinical documents:
+- the `EDSPhraseMatcher`
+- the `RegexMatcher`
+- the `SimstringMatcher`
+However, note that for most use-cases, you should instead use the `eds.matcher` pipe that wraps these classes to annotate documents.
+## EDSPhraseMatcher
+The EDSPhraseMatcher lets you efficiently match large terminology lists, by comparing tokenx against a given attribute.
+This matcher differs from the `spacy.PhraseMatcher` in that it allows to skip pollution tokens. To make it efficient, we
+have reimplemented the matching algorithm in Cython, like the original `spacy.PhraseMatcher`.
+You can use it as described in the code below.
+```python
+import edsnlp, edsnlp.pipes as eds
+from edsnlp.matchers.phrase import EDSPhraseMatcher
+nlp = edsnlp.blank("eds")
+nlp.add_pipe(eds.normalizer())
+doc = nlp("On ne relève pas de signe du Corona =============== virus.")
+matcher = EDSPhraseMatcher(nlp.vocab, attr="NORM")
+matcher.build_patterns(
+    nlp,
+    {
+        "covid": ["corona virus", "coronavirus", "covid"],
+        "diabete": ["diabete", "diabetique"],
+    },
+)
+list(matcher(doc, as_spans=True))[0].text
+# Out: Corona =============== virus
+```
+## RegexMatcher
+The `RegexMatcher` performs full-text regex matching.
+It is especially useful to handle spelling variations like `mammo-?graphies?`.
+Like the `EDSPhraseMatcher`, this class allows to skip pollution tokens.
+Note that this class is significantly slower than the `EDSPhraseMatcher`: if you can, try enumerating
+lexical variations of the target phrases and feed them to the `PhraseMatcher` instead.
+You can use it as described in the code below.
+```python
+import edsnlp, edsnlp.pipes as eds
+from edsnlp.matchers.regex import RegexMatcher
+nlp = edsnlp.blank("eds")
+nlp.add_pipe(eds.normalizer())
+doc = nlp("On ne relève pas de signe du Corona =============== virus.")
+matcher = RegexMatcher(attr="NORM", ignore_excluded=True)
+matcher.build_patterns(
+    {
+        "covid": ["corona[ ]*virus", "covid"],
+        "diabete": ["diabete", "diabetique"],
+    },
+)
+list(matcher(doc, as_spans=True))[0].text
+# Out: Corona =============== virus
+```
+## SimstringMatcher
+The `SimstringMatcher` performs fuzzy term matching by comparing spans of text with a
+similarity metric. It is especially useful to handle spelling variations like
+`paracetomol` (instead of `paracetamol`).
+The [`simstring`](www.chokkan.org/software/simstring/) algorithm compares two strings by enumerating their char trigrams and
+measuring the overlap between the two sets. In the previous example:
+- `paracetomol` becomes `##p #pa par ara rac ace cet eto tom omo mol ol# l##`
+- `paracetamol` becomes `##p #pa par ara rac ace cet eta tam amo mol ol# l##`
+and the Dice (or F1) similarity between the two sets is 0.75.
+Like the `EDSPhraseMatcher`, this class allows to skip pollution tokens.
+Just like the `RegexMatcher`, this class is significantly slower than the
+`EDSPhraseMatcher`: if you can, try enumerating lexical variations of the target phrases
+and feed them to the `PhraseMatcher` instead.
+You can use it as described in the code below.
+```python
+import edsnlp, edsnlp.pipes as eds
+from edsnlp.matchers.simstring import SimstringMatcher
+nlp = edsnlp.blank("eds")
+nlp.add_pipe(eds.normalizer())
+doc = nlp(
+    "On ne relève pas de signe du corona-virus. Historique d'un hepatocellulaire carcinome."
+)
+matcher = SimstringMatcher(
+    nlp.vocab,
+    attr="NORM",
+    ignore_excluded=True,
+    measure="dice",
+    threshold=0.75,
+    windows=5,
+)
+matcher.build_patterns(
+    nlp,
+    {
+        "covid": ["coronavirus", "covid"],
+        "carcinome": ["carcinome hepatocellulaire"],
+    },
+)
+list(matcher(doc, as_spans=True))[0].text
+# Out: corona-virus
+list(matcher(doc, as_spans=True))[1].text
+# Out: hepatocellulaire carcinome
+```