|
a |
|
b/docs/pipes/qualifiers/index.md |
|
|
1 |
# Qualifier Overview |
|
|
2 |
|
|
|
3 |
In EDS-NLP, we call _qualifiers_ the suite of components designed to _qualify_ a |
|
|
4 |
pre-extracted entity for a linguistic modality. |
|
|
5 |
|
|
|
6 |
## Available components |
|
|
7 |
|
|
|
8 |
<!-- --8<-- [start:components] --> |
|
|
9 |
|
|
|
10 |
| Pipeline | Description | |
|
|
11 |
|-----------------------|--------------------------------------| |
|
|
12 |
| `eds.negation` | Rule-based negation detection | |
|
|
13 |
| `eds.family` | Rule-based family context detection | |
|
|
14 |
| `eds.hypothesis` | Rule-based speculation detection | |
|
|
15 |
| `eds.reported_speech` | Rule-based reported speech detection | |
|
|
16 |
| `eds.history` | Rule-based medical history detection | |
|
|
17 |
|
|
|
18 |
<!-- --8<-- [end:components] --> |
|
|
19 |
|
|
|
20 |
## Rationale |
|
|
21 |
|
|
|
22 |
In a typical medical NLP pipeline, a group of clinicians would define a list of synonyms for a given concept of interest (say, for example, diabetes), and look for that terminology in a corpus of documents. |
|
|
23 |
|
|
|
24 |
Now, consider the following example: |
|
|
25 |
|
|
|
26 |
=== "French" |
|
|
27 |
|
|
|
28 |
``` |
|
|
29 |
Le patient n'est pas diabétique. |
|
|
30 |
Le patient est peut-être diabétique. |
|
|
31 |
Le père du patient est diabétique. |
|
|
32 |
``` |
|
|
33 |
|
|
|
34 |
=== "English" |
|
|
35 |
|
|
|
36 |
``` |
|
|
37 |
The patient is not diabetic. |
|
|
38 |
The patient could be diabetic. |
|
|
39 |
The patient's father is diabetic. |
|
|
40 |
``` |
|
|
41 |
|
|
|
42 |
There is an obvious problem: none of these examples should lead us to include this particular patient into the cohort. |
|
|
43 |
|
|
|
44 |
!!! warning |
|
|
45 |
|
|
|
46 |
We show an English example just to explain the issue. |
|
|
47 |
EDS-NLP remains a **French-language** medical NLP library. |
|
|
48 |
|
|
|
49 |
To curb this issue, EDS-NLP proposes rule-based pipes that qualify entities to help the user make an informed decision about which patient should be included in a real-world data cohort. |
|
|
50 |
|
|
|
51 |
## Where do we get our spans ? {: #edsnlp.pipes.base.SpanGetterArg } |
|
|
52 |
|
|
|
53 |
A component get entities from a document by looking up `doc.ents` or `doc.spans[group]`. This behavior is set by the `span_getter` argument in components that support it. |
|
|
54 |
|
|
|
55 |
::: edsnlp.pipes.base.SpanGetterArg |
|
|
56 |
options: |
|
|
57 |
heading_level: 2 |
|
|
58 |
show_bases: false |
|
|
59 |
show_source: false |
|
|
60 |
only_class_level: true |
|
|
61 |
|
|
|
62 |
## Under the hood |
|
|
63 |
|
|
|
64 |
Our _qualifier_ pipes all follow the same basic pattern: |
|
|
65 |
|
|
|
66 |
1. The pipeline extracts cues. We define three (possibly overlapping) kinds : |
|
|
67 |
|
|
|
68 |
- `preceding`, ie cues that _precede_ modulated entities ; |
|
|
69 |
- `following`, ie cues that _follow_ modulated entities ; |
|
|
70 |
- in some cases, `verbs`, ie verbs that convey a modulation (treated as preceding cues). |
|
|
71 |
|
|
|
72 |
2. The pipeline splits the text between sentences and propositions, using annotations from a sentencizer pipeline and `termination` patterns, which define syntagma/proposition terminations. |
|
|
73 |
|
|
|
74 |
3. For each pre-extracted entity, the pipeline checks whether there is a cue between the start of the syntagma and the start of the entity, or a following cue between the end of the entity and the end of the proposition. |
|
|
75 |
|
|
|
76 |
Albeit simple, this algorithm can achieve very good performance depending on the modality. For instance, our `eds.negation` pipeline reaches 88% F1-score on our dataset. |
|
|
77 |
|
|
|
78 |
!!! note "Dealing with pseudo-cues" |
|
|
79 |
|
|
|
80 |
The pipeline can also detect **pseudo-cues**, ie phrases that contain cues but **that are not cues themselves**. For instance: `sans doute`/`without doubt` contains `sans/without`, but does not convey negation. |
|
|
81 |
|
|
|
82 |
Detecting pseudo-cues lets the pipeline filter out any cue that overlaps with a pseudo-cue. |
|
|
83 |
|
|
|
84 |
!!! warning "Sentence boundaries are required" |
|
|
85 |
|
|
|
86 |
The rule-based algorithm detects cues, and propagate their modulation on the rest of the [syntagma](https://en.wikipedia.org/wiki/Syntagma_(linguistics)){target=_blank}. For that reason, a qualifier pipeline needs a sentencizer component to be defined, and will fail otherwise. |
|
|
87 |
|
|
|
88 |
You may use EDS-NLP's: |
|
|
89 |
|
|
|
90 |
```{ .python .no-check } |
|
|
91 |
import edsnlp, edsnlp.pipes as eds |
|
|
92 |
|
|
|
93 |
... |
|
|
94 |
nlp.add_pipe(eds.sentences()) |
|
|
95 |
``` |
|
|
96 |
|
|
|
97 |
## Persisting the results |
|
|
98 |
|
|
|
99 |
Our qualifier pipelines write their results to a custom [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes){target=_blank}, defined on both `Span` and `Token` objects. We follow the convention of naming said attribute after the pipeline itself, eg `Span._.negation` for the`eds.negation` pipeline. |
|
|
100 |
|
|
|
101 |
We also provide a string representation of the result, computed on the fly by declaring a getter that reads the boolean result of the pipeline. Following spaCy convention, we give this attribute the same name, followed by a `_`. |