a b/docs/source/tutorial.md
1
# Tutorial
2
3
`deduce` is a rule-based de-identification method for clinical text written in Dutch, which finds and removes information in one or more categories of interest (e.g. person names, names of institutions, locations). In principle, `deduce` can work 'out of the box', however, based on both scientific research and personal experience, `deduce` is unlikely to remove all sensitive information when no effort goes into some customization. This tutorial should help you reach that goal. Along with basic steps to get started and highlights of some features, further in this tutorial, we describe how to tailor `deduce` to your specific data. 
4
5
It's useful to note that from version `2.0.0`, `deduce` is built using `docdeid`([docs](https://docdeid.readthedocs.io/en/latest/), [GitHub](https://github.com/vmenger/docdeid)), a small framework that helps build de-identifiers. Before you start customizing `deduce`, checking the `docdeid` docs will probably make it easier still.  
6
7
In case you get stuck with applying or modifying `deduce`, its always possible to ask for help, by creating an issue in our [issue tracker](https://github.com/vmenger/deduce/issues)!
8
9
```{include} ../../README.md
10
:start-after: <!-- start getting started -->
11
:end-before: <!-- end getting started -->
12
```
13
14
## Included components
15
16
A `docdeid` de-identifier is made up of document processors, such as annotators, annotation processors, and redactors, that are applied sequentially in a pipeline. The most important components that make up `deduce` are described below.
17
18
### Annotators
19
20
The `Annotator` is responsible for tagging pieces of information in the text as sensitive information that needs to be removed. `deduce` includes various annotators, described below:
21
22
| Group           | Annotator Name       | Annotator Type                              | Explanation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
23
|-----------------|----------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
24
| names           | prefix_with_initial  | `deduce.annotator.TokenPatternAnnotator`    | Matches a prefix followed by initial(s)                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
25
|                 | prefix_with_interfix | `deduce.annotator.TokenPatternAnnotator`    | Matches a prefix followed by an interfix and something that resembles a name                                                                                                                                                                                                                                                                                                                                                                                                                                      |
26
|                 | prefix_with_name     | `deduce.annotator.TokenPatternAnnotator`    | Matches a prefix followed by something that resembles a name                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
27
|                 | interfix_with_name   | `deduce.annotator.TokenPatternAnnotator`    | Matches an interfix followed by something that resembles a name                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
28
|                 | initial_with_name    | `deduce.annotator.TokenPatternAnnotator`    | Matches an initial followed by something that resembles a name                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
29
|                 | initial_interfix     | `deduce.annotator.TokenPatternAnnotator`    | Matches an initial followed by an interfix and something that resembles a name                                                                                                                                                                                                                                                                                                                                                                                                                                    |
30
|                 | first_name_lookup    | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on first names from Voornamenbank (Meertens Instituut)                                                                                                                                                                                                                                                                                                                                                                                                                                               |
31
|                 | surname_lookup       | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on surnames from Familienamenbank (Meertens Instituut)                                                                                                                                                                                                                                                                                                                                                                                                                                               |
32
|                 | patient_name         | `deduce.annotator.PatientNameAnnotator`     | Custom logic to match patient name, if supplied in document metadata                                                                                                                                                                                                                                                                                                                                                                                                                                              |
33
|                 | name_context         | `deduce.annotator.ContextAnnotator`         | Matches names based on annotations found above, with the following context patterns:  `interfix_right`: An interfix and something that resembles a name, when preceded by a detected initial or name `initial_left`: An initial, when followed by a detected initial, name or interfix `naam_left`: Something that resembles a name, when followed by a name `naam_right`: Something that resembles a name, when preceded by a name `prefix_left`: A prefix, when followed by a prefix, initial, name or interfix |
34
|                 | eponymous_disease    | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on eponymous diseases, which will be tagged with `pseudo_name` and removed later (along with any overlap)                                                                                                                                                                                                                                                                                                                                                                                            |
35
| locations       | placename            | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a compiled list of regions, provinces, municipalities and residences                                                                                                                                                                                                                                                                                                                                                                                                                              |
36
|                 | street_pattern       | `docdeid.process.RegexpAnnotator`           | Matches streetnames based on a pattern (ending in straat, plein, dam, etc.)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
37
|                 | street_lookup        | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of streetnames from Basisadministratie Gemeenten                                                                                                                                                                                                                                                                                                                                                                                                                                           |
38
|                 | housenumber          | `deduce.annotator.ContextAnnotator`         | Matches housenumber and housenumberletters, based on the following context patterns: `housenumber_right`: a 1-4 digit number, preceded by a streetname `housenumber_housenumberletter_right`: a 1-4 digit number and a single letter, preceded by a streetname `housenumberletter_right`: a single letter, preceded by a housenumber                                                                                                                                                                              |
39
|                 | postal_code          | `docdeid.process.RegexpAnnotator`           | Matches Dutch postal codes, i.e. four digits followed by two letters                                                                                                                                                                                                                                                                                                                                                                                                                                              |
40
|                 | postbus              | `docdeid.process.RegexpAnnotator`           | Matches postbus, i.e. 'Postbus' followed by a 1-5 digit number, optionally with periods between them.                                                                                                                                                                                                                                                                                                                                                                                                             |
41
| institution     | hospital             | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of hospitals.                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
42
|                 | institution          | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of healthcare institutions, based on Zorgkaart Nederland.                                                                                                                                                                                                                                                                                                                                                                                                                                  |
43
| dates           | date_dmy_1           | `docdeid.process.RegexpAnnotator`           | Matches dates in dmy format, e.g. 01-01-2012                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
44
|                 | date_dmy_2           | `docdeid.process.RegexpAnnotator`           | Matches dates in dmy format, e.g. 01 jan 2012                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
45
|                 | date_ymd_1           | `docdeid.process.RegexpAnnotator`           | Matches dates in ymd format, e.g. 2012-01-01                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
46
|                 | date_ymd_2           | `docdeid.process.RegexpAnnotator`           | Matches dates in ymd format, e.g. 2012 jan 01                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
47
| ages            | age                  | `deduce.annotator.RegexpPseudoAnnotator`    | Matches ages based on a number of digit patterns followed by jaar/jaar oud. Excludes matches that are preceded/followed by one of the `pre_pseudo` / `post_pseudo` words, e.g. 'sinds 10 jaar`                                                                                                                                                                                                                                                                                                                    |
48
| identifiers     | bsn                  | `deduce.annotator.BsnAnnotator`             | Matches Dutch social security numbers (BSN), based on a 9-digit pattern that also passes the 'elfproef'                                                                                                                                                                                                                                                                                                                                                                                                           |
49
|                 | identifier           | `docdeid.process.RegexpAnnotator`           | Matches any 7+ digit number as identifier                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
50
| phone_numbers   | phone                | `deduce.annotator.PhoneNumberAnnotator`     | Matches phone numbers, based on regular expression pattern, optionally with a digit too few or a digit too much (common typos)                                                                                                                                                                                                                                                                                                                                                                                    |
51
| email_addresses | email                | `docdeid.process.RegexpAnnotator`           | Matches e-mail addresses, based on regular expression pattern                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
52
| urls            | url                  | `docdeid.process.RegexpAnnotator`           | Matches urls, based on regular expression pattern                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
53
54
It's possible to add, remove, apply subsets, or to implement custom annotators, those options are described further down under [customizing `deduce`](#customizing-deduce). 
55
56
### Other processors
57
58
In addition to annotators, a `docdeid` de-identifier contains annotation processors, which do some operation to the set of annotations generated previously, and redactors, which take the annotation and replace them in the text. Other processors included in `deduce` are listed below:
59
60
| **Name**                    | **Group**       | **Description**                                                                                       |
61
|-----------------------------|-----------------|-------------------------------------------------------------------------------------------------------|
62
| person_annotation_converter | names           | Maps name tags to either PERSON or PATIENT, and removes overlap with 'pseudo_name'.                   |
63
| remove_street_tags          | locations       | Removes any matched street names that are not followed by a housenumber                               |
64
| clean_street_tags           | locations       | Cleans up street tags, e.g. straat+huisnummer -> locatie                                              |
65
| overlap_resolver            | post_processing | Makes sure overlap among annotations is resolved.                                                     |
66
| merge_adjacent_annotations  | post_processing | If there are any adjacent annotations with the same tag, they are merged into a single annotation.    |
67
| redactor                    | post_processing | Takes care of replacing the annotated PHIs with `[TAG]` (e.g. `[LOCATION-1]`, `[DATE-2]`)             |
68
69
### Lookup sets
70
71
In order to match tokens to known identifiable words or concepts, `deduce` has the following builtin lookup sets:
72
73
| **Name**               | **Size** | **Examples**                                                                           |
74
|------------------------|----------|----------------------------------------------------------------------------------------|
75
| prefix                 | 45       | bc., dhr., mijnheer                                                                    | 
76
| initial                | 54       | Q, I, U                                                                                |
77
| interfix               | 44       | van de, von, v/d                                                                       |
78
| first_name             | 14690    | Martin, Alco, Wieke                                                                    |
79
| interfix_surname       | 2384     | Rijke, Butter, Agtmaal                                                                 |
80
| surname                | 10346    | Kosters, Hilderink, Kogelman                                                           |
81
| hospital               | 9283     | Oude en Nieuwe Gasthuis, sint Jans zkh., Dijklander                                    |
82
| hospital_abbr          | 21       | UMCG, WKZ, PMC                                                                         |
83
| healthcare_institution | 244342   | Gezondheidscentrum Wesselerbrink, Fysiotherapie Heer, Ergotherapie Tilburg-Waalwyk eo. |
84
| placename              | 12049    | De Plaats, Diefdijk (U), Het Haantje (DR)                                              |
85
| street                 | 769569   | Ds. Van Diemenstraat, Jac. v den Eyndestr, Matenstr                                    |
86
| eponymous_disease      | 22512    | tumor van Brucellosis, Lobomycosis reactie, syndroom van Alagille                      | 
87
| common_word            | 1008     | al, tuin, brengen                                                                      |
88
| medical_term           | 6939     | bevattingsvermogen, iliacaal, oor                                                      |
89
| stop_word              | 101      | kan, heb, dat                                                                          |
90
91
## Customizing deduce
92
93
We highly recommend making some effort to customize `deduce`, as even some basic effort will almost surely increase accuracy. Below are outlined some ways to achieve this, including: making changes to the config, adding/removing custom pipeline components, and modifying the builtin lookup sets.
94
95
### Adding a custom config 
96
97
A default `base_config.json` ([source on GitHub](https://github.com/vmenger/deduce/blob/main/base_config.json)) file is packaged with `deduce`. Among with some basic settings, it defines all annotators (also listed above). Override settings, by providing an additional user config to Deduce, either as a file or as a dict: 
98
99
```python
100
from deduce import Deduce
101
102
deduce = Deduce(config='my_own_config.json')
103
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})
104
```
105
106
This will only override settings that are explicitly set in the user config, all other settings are kept as is. If you want to add or delete annotators (e.g. changing regular expressions), it's easiest to make a copy of `base_config.json`, and load it as follows: 
107
108
```python
109
from deduce import Deduce
110
111
deduce = Deduce(load_base_config=False, config='my_own_config.json')
112
```
113
114
Note that you will now miss out on any updates to the base config that are packaged with new versions of Deduce. For that reason, a better way to add/remove processors is to [interact with `Deduce.processors` directly](#implementing-custom-components) after creating the model.
115
116
### Using `disabled` keyword to disable components
117
118
It's possible to disable specific (groups of) annotators or processors when deidentifying a text. For example, to apply all annotators, except those in the identifiers group: 
119
120
```python
121
from deduce import Deduce
122
123
deduce = Deduce()
124
deduce.deidentify(text, disabled={'identifiers'})
125
```
126
127
Or, to disable one specific date annotator in the dates group, but keeping the other date patterns:
128
129
```python
130
from deduce import Deduce
131
132
deduce = Deduce()
133
deduce.deidentify("text", disabled={'date_dmy_1'})
134
```
135
136
### Using `enabled` keyword
137
138
Although it's also possible to _enable_ only some processors, this is only useful in a limited amount of cases. You must manually specify the groups, individual annotators, and postprocessors to have a sensible output. For example, to de-identify only e-mail addresses, use:
139
140
```python
141
from deduce import Deduce
142
143
deduce = Deduce()
144
deduce.deidentify("text", enabled={
145
    'email-addresses', # annotator group, with annotators:
146
    'email', 
147
    'post_processing', # post processing group, with processors:
148
    'overlap_resolver',
149
    'merge_adjacent_annotations',
150
    'redactor'
151
})
152
```
153
154
The following example however will apply **no annotators**, as the `email` annotator is enabled, but its' group `email-addresses` is not: 
155
156
```python
157
from deduce import Deduce
158
159
deduce = Deduce()
160
deduce.deidentify("text", enabled={'email'})
161
```
162
163
### Implementing custom components
164
165
It's possible to implement the following custom components,  `Annotator`, `AnnotationProcessor`, `Redactor` and `Tokenizer`. This is done by implementing the abstract classes defined in the `docdeid` package, which is described here: [docdeid docs - docdeid components](https://docdeid.readthedocs.io/en/latest/tutorial.html#docdeid-components).
166
167
In our case, we can add or remove custom document processors by interacting with the `deduce.processors` attribute directly:
168
169
```python
170
from deduce import Deduce
171
172
deduce = Deduce()
173
174
# remove date annotators
175
del deduce.processors['dates']
176
177
# add another annotator
178
deduce.processors.add_processor( 
179
    'some_new_category', 
180
    MyCustomAnnotator(), 
181
    position=0
182
) 
183
```
184
185
Note that by default, processors are applied in the order they are added to the pipeline. To prevent a new annotator being added after the `post_processing` group (meaning the annotations would not be redacted in the text), use the `position` keyword of the `add_processor` method, as in the example above.
186
187
#### Changing tokenizer
188
189
There might be a case where you want to add a custom annotator to `deduce` that requires its own tokenizing logic. Replacing the builtin tokenizer is not recommended, as builtin annotators depend on it, but it's possible to add more tokenizers as follows:
190
191
```python
192
from deduce import Deduce
193
194
deduce = Deduce()
195
deduce.tokenizers['my_custom_tokenizer'] = MyCustomTokenizer() # make sure this implements abstract docdeid.tokenize.Tokenizer
196
```
197
198
Then annotators can use:
199
200
```python
201
import docdeid as dd
202
203
def annotate(doc: dd.Document):
204
    tokens = doc.get_tokens("my_custom_tokenizer")
205
```
206
207
### Tailoring lookup structures
208
209
Updating the builtin lookup sets and tries is a very useful and straightforward way to tailor `deduce`. Changes can be made directly from the `Deduce.lookup_structs` attribute, as such: 
210
211
```python
212
from deduce import Deduce
213
214
deduce = Deduce()
215
216
# sets
217
deduce.lookup_structs['first_names'].add_items_from_iterable(["naam", "andere_naam"])
218
deduce.lookup_structs['whitelist'].add_items_from_iterable(["woord", "ander_woord"])
219
220
# tries
221
deduce.lookup_structs['residences'].add_items(["kleine", "plaats", "in", "de", "regio"])
222
deduce.lookup_structs['institutions'].add_items_from_iterable(["verzorgingstehuis", "hier", "om", "de", "hoek"])
223
224
```
225
226
Full documentation on sets and tries, and how to modify them, is available in the [docdeid API](https://docdeid.readthedocs.io/en/latest/api/docdeid.ds.html#docdeid.ds.lookup.LookupSet).
227
228
Larger changes may also be made by copying the source files and modifying them directly, by pointing `deduce` to the directory with modified sources:
229
230
```python
231
from deduce import Deduce
232
233
deduce = Deduce(lookup_data_path="/my/path")
234
```
235
236
It's important to copy the directory, or your changes will be overwritten with the next `deduce` update. Currently, there is no additional documentation available on how to structure and transform the lookup items in the directory, other than inspecting the pre-packaged files. Also remember that any updates to lookup values in next releases of Deduce will not be applied if `deduce` loads items from a copy, differences need to be tracked manually with each release.