|
a |
|
b/docs/source/tutorial.md |
|
|
1 |
# Tutorial |
|
|
2 |
|
|
|
3 |
`deduce` is a rule-based de-identification method for clinical text written in Dutch, which finds and removes information in one or more categories of interest (e.g. person names, names of institutions, locations). In principle, `deduce` can work 'out of the box', however, based on both scientific research and personal experience, `deduce` is unlikely to remove all sensitive information when no effort goes into some customization. This tutorial should help you reach that goal. Along with basic steps to get started and highlights of some features, further in this tutorial, we describe how to tailor `deduce` to your specific data. |
|
|
4 |
|
|
|
5 |
It's useful to note that from version `2.0.0`, `deduce` is built using `docdeid`([docs](https://docdeid.readthedocs.io/en/latest/), [GitHub](https://github.com/vmenger/docdeid)), a small framework that helps build de-identifiers. Before you start customizing `deduce`, checking the `docdeid` docs will probably make it easier still. |
|
|
6 |
|
|
|
7 |
In case you get stuck with applying or modifying `deduce`, its always possible to ask for help, by creating an issue in our [issue tracker](https://github.com/vmenger/deduce/issues)! |
|
|
8 |
|
|
|
9 |
```{include} ../../README.md |
|
|
10 |
:start-after: <!-- start getting started --> |
|
|
11 |
:end-before: <!-- end getting started --> |
|
|
12 |
``` |
|
|
13 |
|
|
|
14 |
## Included components |
|
|
15 |
|
|
|
16 |
A `docdeid` de-identifier is made up of document processors, such as annotators, annotation processors, and redactors, that are applied sequentially in a pipeline. The most important components that make up `deduce` are described below. |
|
|
17 |
|
|
|
18 |
### Annotators |
|
|
19 |
|
|
|
20 |
The `Annotator` is responsible for tagging pieces of information in the text as sensitive information that needs to be removed. `deduce` includes various annotators, described below: |
|
|
21 |
|
|
|
22 |
| Group | Annotator Name | Annotator Type | Explanation | |
|
|
23 |
|-----------------|----------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
|
24 |
| names | prefix_with_initial | `deduce.annotator.TokenPatternAnnotator` | Matches a prefix followed by initial(s) | |
|
|
25 |
| | prefix_with_interfix | `deduce.annotator.TokenPatternAnnotator` | Matches a prefix followed by an interfix and something that resembles a name | |
|
|
26 |
| | prefix_with_name | `deduce.annotator.TokenPatternAnnotator` | Matches a prefix followed by something that resembles a name | |
|
|
27 |
| | interfix_with_name | `deduce.annotator.TokenPatternAnnotator` | Matches an interfix followed by something that resembles a name | |
|
|
28 |
| | initial_with_name | `deduce.annotator.TokenPatternAnnotator` | Matches an initial followed by something that resembles a name | |
|
|
29 |
| | initial_interfix | `deduce.annotator.TokenPatternAnnotator` | Matches an initial followed by an interfix and something that resembles a name | |
|
|
30 |
| | first_name_lookup | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on first names from Voornamenbank (Meertens Instituut) | |
|
|
31 |
| | surname_lookup | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on surnames from Familienamenbank (Meertens Instituut) | |
|
|
32 |
| | patient_name | `deduce.annotator.PatientNameAnnotator` | Custom logic to match patient name, if supplied in document metadata | |
|
|
33 |
| | name_context | `deduce.annotator.ContextAnnotator` | Matches names based on annotations found above, with the following context patterns: `interfix_right`: An interfix and something that resembles a name, when preceded by a detected initial or name `initial_left`: An initial, when followed by a detected initial, name or interfix `naam_left`: Something that resembles a name, when followed by a name `naam_right`: Something that resembles a name, when preceded by a name `prefix_left`: A prefix, when followed by a prefix, initial, name or interfix | |
|
|
34 |
| | eponymous_disease | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on eponymous diseases, which will be tagged with `pseudo_name` and removed later (along with any overlap) | |
|
|
35 |
| locations | placename | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a compiled list of regions, provinces, municipalities and residences | |
|
|
36 |
| | street_pattern | `docdeid.process.RegexpAnnotator` | Matches streetnames based on a pattern (ending in straat, plein, dam, etc.) | |
|
|
37 |
| | street_lookup | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of streetnames from Basisadministratie Gemeenten | |
|
|
38 |
| | housenumber | `deduce.annotator.ContextAnnotator` | Matches housenumber and housenumberletters, based on the following context patterns: `housenumber_right`: a 1-4 digit number, preceded by a streetname `housenumber_housenumberletter_right`: a 1-4 digit number and a single letter, preceded by a streetname `housenumberletter_right`: a single letter, preceded by a housenumber | |
|
|
39 |
| | postal_code | `docdeid.process.RegexpAnnotator` | Matches Dutch postal codes, i.e. four digits followed by two letters | |
|
|
40 |
| | postbus | `docdeid.process.RegexpAnnotator` | Matches postbus, i.e. 'Postbus' followed by a 1-5 digit number, optionally with periods between them. | |
|
|
41 |
| institution | hospital | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of hospitals. | |
|
|
42 |
| | institution | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of healthcare institutions, based on Zorgkaart Nederland. | |
|
|
43 |
| dates | date_dmy_1 | `docdeid.process.RegexpAnnotator` | Matches dates in dmy format, e.g. 01-01-2012 | |
|
|
44 |
| | date_dmy_2 | `docdeid.process.RegexpAnnotator` | Matches dates in dmy format, e.g. 01 jan 2012 | |
|
|
45 |
| | date_ymd_1 | `docdeid.process.RegexpAnnotator` | Matches dates in ymd format, e.g. 2012-01-01 | |
|
|
46 |
| | date_ymd_2 | `docdeid.process.RegexpAnnotator` | Matches dates in ymd format, e.g. 2012 jan 01 | |
|
|
47 |
| ages | age | `deduce.annotator.RegexpPseudoAnnotator` | Matches ages based on a number of digit patterns followed by jaar/jaar oud. Excludes matches that are preceded/followed by one of the `pre_pseudo` / `post_pseudo` words, e.g. 'sinds 10 jaar` | |
|
|
48 |
| identifiers | bsn | `deduce.annotator.BsnAnnotator` | Matches Dutch social security numbers (BSN), based on a 9-digit pattern that also passes the 'elfproef' | |
|
|
49 |
| | identifier | `docdeid.process.RegexpAnnotator` | Matches any 7+ digit number as identifier | |
|
|
50 |
| phone_numbers | phone | `deduce.annotator.PhoneNumberAnnotator` | Matches phone numbers, based on regular expression pattern, optionally with a digit too few or a digit too much (common typos) | |
|
|
51 |
| email_addresses | email | `docdeid.process.RegexpAnnotator` | Matches e-mail addresses, based on regular expression pattern | |
|
|
52 |
| urls | url | `docdeid.process.RegexpAnnotator` | Matches urls, based on regular expression pattern | |
|
|
53 |
|
|
|
54 |
It's possible to add, remove, apply subsets, or to implement custom annotators, those options are described further down under [customizing `deduce`](#customizing-deduce). |
|
|
55 |
|
|
|
56 |
### Other processors |
|
|
57 |
|
|
|
58 |
In addition to annotators, a `docdeid` de-identifier contains annotation processors, which do some operation to the set of annotations generated previously, and redactors, which take the annotation and replace them in the text. Other processors included in `deduce` are listed below: |
|
|
59 |
|
|
|
60 |
| **Name** | **Group** | **Description** | |
|
|
61 |
|-----------------------------|-----------------|-------------------------------------------------------------------------------------------------------| |
|
|
62 |
| person_annotation_converter | names | Maps name tags to either PERSON or PATIENT, and removes overlap with 'pseudo_name'. | |
|
|
63 |
| remove_street_tags | locations | Removes any matched street names that are not followed by a housenumber | |
|
|
64 |
| clean_street_tags | locations | Cleans up street tags, e.g. straat+huisnummer -> locatie | |
|
|
65 |
| overlap_resolver | post_processing | Makes sure overlap among annotations is resolved. | |
|
|
66 |
| merge_adjacent_annotations | post_processing | If there are any adjacent annotations with the same tag, they are merged into a single annotation. | |
|
|
67 |
| redactor | post_processing | Takes care of replacing the annotated PHIs with `[TAG]` (e.g. `[LOCATION-1]`, `[DATE-2]`) | |
|
|
68 |
|
|
|
69 |
### Lookup sets |
|
|
70 |
|
|
|
71 |
In order to match tokens to known identifiable words or concepts, `deduce` has the following builtin lookup sets: |
|
|
72 |
|
|
|
73 |
| **Name** | **Size** | **Examples** | |
|
|
74 |
|------------------------|----------|----------------------------------------------------------------------------------------| |
|
|
75 |
| prefix | 45 | bc., dhr., mijnheer | |
|
|
76 |
| initial | 54 | Q, I, U | |
|
|
77 |
| interfix | 44 | van de, von, v/d | |
|
|
78 |
| first_name | 14690 | Martin, Alco, Wieke | |
|
|
79 |
| interfix_surname | 2384 | Rijke, Butter, Agtmaal | |
|
|
80 |
| surname | 10346 | Kosters, Hilderink, Kogelman | |
|
|
81 |
| hospital | 9283 | Oude en Nieuwe Gasthuis, sint Jans zkh., Dijklander | |
|
|
82 |
| hospital_abbr | 21 | UMCG, WKZ, PMC | |
|
|
83 |
| healthcare_institution | 244342 | Gezondheidscentrum Wesselerbrink, Fysiotherapie Heer, Ergotherapie Tilburg-Waalwyk eo. | |
|
|
84 |
| placename | 12049 | De Plaats, Diefdijk (U), Het Haantje (DR) | |
|
|
85 |
| street | 769569 | Ds. Van Diemenstraat, Jac. v den Eyndestr, Matenstr | |
|
|
86 |
| eponymous_disease | 22512 | tumor van Brucellosis, Lobomycosis reactie, syndroom van Alagille | |
|
|
87 |
| common_word | 1008 | al, tuin, brengen | |
|
|
88 |
| medical_term | 6939 | bevattingsvermogen, iliacaal, oor | |
|
|
89 |
| stop_word | 101 | kan, heb, dat | |
|
|
90 |
|
|
|
91 |
## Customizing deduce |
|
|
92 |
|
|
|
93 |
We highly recommend making some effort to customize `deduce`, as even some basic effort will almost surely increase accuracy. Below are outlined some ways to achieve this, including: making changes to the config, adding/removing custom pipeline components, and modifying the builtin lookup sets. |
|
|
94 |
|
|
|
95 |
### Adding a custom config |
|
|
96 |
|
|
|
97 |
A default `base_config.json` ([source on GitHub](https://github.com/vmenger/deduce/blob/main/base_config.json)) file is packaged with `deduce`. Among with some basic settings, it defines all annotators (also listed above). Override settings, by providing an additional user config to Deduce, either as a file or as a dict: |
|
|
98 |
|
|
|
99 |
```python |
|
|
100 |
from deduce import Deduce |
|
|
101 |
|
|
|
102 |
deduce = Deduce(config='my_own_config.json') |
|
|
103 |
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'}) |
|
|
104 |
``` |
|
|
105 |
|
|
|
106 |
This will only override settings that are explicitly set in the user config, all other settings are kept as is. If you want to add or delete annotators (e.g. changing regular expressions), it's easiest to make a copy of `base_config.json`, and load it as follows: |
|
|
107 |
|
|
|
108 |
```python |
|
|
109 |
from deduce import Deduce |
|
|
110 |
|
|
|
111 |
deduce = Deduce(load_base_config=False, config='my_own_config.json') |
|
|
112 |
``` |
|
|
113 |
|
|
|
114 |
Note that you will now miss out on any updates to the base config that are packaged with new versions of Deduce. For that reason, a better way to add/remove processors is to [interact with `Deduce.processors` directly](#implementing-custom-components) after creating the model. |
|
|
115 |
|
|
|
116 |
### Using `disabled` keyword to disable components |
|
|
117 |
|
|
|
118 |
It's possible to disable specific (groups of) annotators or processors when deidentifying a text. For example, to apply all annotators, except those in the identifiers group: |
|
|
119 |
|
|
|
120 |
```python |
|
|
121 |
from deduce import Deduce |
|
|
122 |
|
|
|
123 |
deduce = Deduce() |
|
|
124 |
deduce.deidentify(text, disabled={'identifiers'}) |
|
|
125 |
``` |
|
|
126 |
|
|
|
127 |
Or, to disable one specific date annotator in the dates group, but keeping the other date patterns: |
|
|
128 |
|
|
|
129 |
```python |
|
|
130 |
from deduce import Deduce |
|
|
131 |
|
|
|
132 |
deduce = Deduce() |
|
|
133 |
deduce.deidentify("text", disabled={'date_dmy_1'}) |
|
|
134 |
``` |
|
|
135 |
|
|
|
136 |
### Using `enabled` keyword |
|
|
137 |
|
|
|
138 |
Although it's also possible to _enable_ only some processors, this is only useful in a limited amount of cases. You must manually specify the groups, individual annotators, and postprocessors to have a sensible output. For example, to de-identify only e-mail addresses, use: |
|
|
139 |
|
|
|
140 |
```python |
|
|
141 |
from deduce import Deduce |
|
|
142 |
|
|
|
143 |
deduce = Deduce() |
|
|
144 |
deduce.deidentify("text", enabled={ |
|
|
145 |
'email-addresses', # annotator group, with annotators: |
|
|
146 |
'email', |
|
|
147 |
'post_processing', # post processing group, with processors: |
|
|
148 |
'overlap_resolver', |
|
|
149 |
'merge_adjacent_annotations', |
|
|
150 |
'redactor' |
|
|
151 |
}) |
|
|
152 |
``` |
|
|
153 |
|
|
|
154 |
The following example however will apply **no annotators**, as the `email` annotator is enabled, but its' group `email-addresses` is not: |
|
|
155 |
|
|
|
156 |
```python |
|
|
157 |
from deduce import Deduce |
|
|
158 |
|
|
|
159 |
deduce = Deduce() |
|
|
160 |
deduce.deidentify("text", enabled={'email'}) |
|
|
161 |
``` |
|
|
162 |
|
|
|
163 |
### Implementing custom components |
|
|
164 |
|
|
|
165 |
It's possible to implement the following custom components, `Annotator`, `AnnotationProcessor`, `Redactor` and `Tokenizer`. This is done by implementing the abstract classes defined in the `docdeid` package, which is described here: [docdeid docs - docdeid components](https://docdeid.readthedocs.io/en/latest/tutorial.html#docdeid-components). |
|
|
166 |
|
|
|
167 |
In our case, we can add or remove custom document processors by interacting with the `deduce.processors` attribute directly: |
|
|
168 |
|
|
|
169 |
```python |
|
|
170 |
from deduce import Deduce |
|
|
171 |
|
|
|
172 |
deduce = Deduce() |
|
|
173 |
|
|
|
174 |
# remove date annotators |
|
|
175 |
del deduce.processors['dates'] |
|
|
176 |
|
|
|
177 |
# add another annotator |
|
|
178 |
deduce.processors.add_processor( |
|
|
179 |
'some_new_category', |
|
|
180 |
MyCustomAnnotator(), |
|
|
181 |
position=0 |
|
|
182 |
) |
|
|
183 |
``` |
|
|
184 |
|
|
|
185 |
Note that by default, processors are applied in the order they are added to the pipeline. To prevent a new annotator being added after the `post_processing` group (meaning the annotations would not be redacted in the text), use the `position` keyword of the `add_processor` method, as in the example above. |
|
|
186 |
|
|
|
187 |
#### Changing tokenizer |
|
|
188 |
|
|
|
189 |
There might be a case where you want to add a custom annotator to `deduce` that requires its own tokenizing logic. Replacing the builtin tokenizer is not recommended, as builtin annotators depend on it, but it's possible to add more tokenizers as follows: |
|
|
190 |
|
|
|
191 |
```python |
|
|
192 |
from deduce import Deduce |
|
|
193 |
|
|
|
194 |
deduce = Deduce() |
|
|
195 |
deduce.tokenizers['my_custom_tokenizer'] = MyCustomTokenizer() # make sure this implements abstract docdeid.tokenize.Tokenizer |
|
|
196 |
``` |
|
|
197 |
|
|
|
198 |
Then annotators can use: |
|
|
199 |
|
|
|
200 |
```python |
|
|
201 |
import docdeid as dd |
|
|
202 |
|
|
|
203 |
def annotate(doc: dd.Document): |
|
|
204 |
tokens = doc.get_tokens("my_custom_tokenizer") |
|
|
205 |
``` |
|
|
206 |
|
|
|
207 |
### Tailoring lookup structures |
|
|
208 |
|
|
|
209 |
Updating the builtin lookup sets and tries is a very useful and straightforward way to tailor `deduce`. Changes can be made directly from the `Deduce.lookup_structs` attribute, as such: |
|
|
210 |
|
|
|
211 |
```python |
|
|
212 |
from deduce import Deduce |
|
|
213 |
|
|
|
214 |
deduce = Deduce() |
|
|
215 |
|
|
|
216 |
# sets |
|
|
217 |
deduce.lookup_structs['first_names'].add_items_from_iterable(["naam", "andere_naam"]) |
|
|
218 |
deduce.lookup_structs['whitelist'].add_items_from_iterable(["woord", "ander_woord"]) |
|
|
219 |
|
|
|
220 |
# tries |
|
|
221 |
deduce.lookup_structs['residences'].add_items(["kleine", "plaats", "in", "de", "regio"]) |
|
|
222 |
deduce.lookup_structs['institutions'].add_items_from_iterable(["verzorgingstehuis", "hier", "om", "de", "hoek"]) |
|
|
223 |
|
|
|
224 |
``` |
|
|
225 |
|
|
|
226 |
Full documentation on sets and tries, and how to modify them, is available in the [docdeid API](https://docdeid.readthedocs.io/en/latest/api/docdeid.ds.html#docdeid.ds.lookup.LookupSet). |
|
|
227 |
|
|
|
228 |
Larger changes may also be made by copying the source files and modifying them directly, by pointing `deduce` to the directory with modified sources: |
|
|
229 |
|
|
|
230 |
```python |
|
|
231 |
from deduce import Deduce |
|
|
232 |
|
|
|
233 |
deduce = Deduce(lookup_data_path="/my/path") |
|
|
234 |
``` |
|
|
235 |
|
|
|
236 |
It's important to copy the directory, or your changes will be overwritten with the next `deduce` update. Currently, there is no additional documentation available on how to structure and transform the lookup items in the directory, other than inspecting the pre-packaged files. Also remember that any updates to lookup values in next releases of Deduce will not be applied if `deduce` loads items from a copy, differences need to be tracked manually with each release. |