Diff of /docs/source/migrating.md [000000] .. [79668b]

Switch to unified view

a b/docs/source/migrating.md
1
# Migrating to version `3.0.0`
2
3
Version `3.0.0` of `deduce` includes many optimizations that allow more accurate de-identification, some already included in `2.1.0` - `2.5.0.` It also includes some structural optimizations. Version `3.0.0` should be backwards compatible, but some functionality is scheduled for removal in `3.1.0`. Those changes are listed below.
4
5
## Custom config
6
7
Adding a custom config is now possible as a `dict` or as a filename pointing to a `json`. Both should be presented to `deduce` with the `config` keyword, e.g.:
8
9
```python
10
deduce = Deduce(config='my_own_config.json')
11
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})
12
```
13
14
The `config_file` keyword is no longer used, please use `config` instead.
15
16
## Lookup structure names
17
18
For consistency, lookup structures names are now all in singular form:
19
20
| **Old name**            | **New name**           |
21
|-------------------------|------------------------|
22
| prefixes                | prefix                 |
23
| first_names             | first_name             |
24
| interfixes              | interfixes             |
25
| interfix_surnames       | interfix_surname       |
26
| surnames                | surname                |
27
| streets                 | street                 |
28
| placenames              | placename              |
29
| hospitals               | hospital               |
30
| healthcare_institutions | healthcare_institution |
31
32
Additionally, the `first_name_exceptions` and `surname_exceptions` list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc.
33
34
## The `annotator_type` field in config
35
36
In a config, each each annotator should specify `annotator_type`, so `Deduce` knows what annotator to load. In `3.0.0` we simplified this a bit. In most cases, the `annotator_type` field should be set to `module.Class` of the annotator that should be loaded, and `Deduce` will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes:
37
38
| **annotator_type**   | **Change**                                                                                                                                                                                                                                                                                                                                           |
39
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
40
| multi_token          | `docdeid.process.MultiTokenLookupAnnotator`                                                                                                                                                                                                                                                                                                          |
41
| dd_token_pattern     | This used to load `docdeid.process.TokenPatternAnnotator`, but this is now replaced by `deduce.annotator.TokenPatternAnnotator`. The latter is more poweful, but needs a different pattern. A `docdeid.process.TokenPatternAnnotator` can no longer be loaded through config, although adding it manually to `Deduce.processors` is always possible. |
42
| token_pattern        | `deduce.annotator.TokenPatternAnnotator`                                                                                                                                                                                                                                                                                                             |
43
| annotation_context   | `deduce.annotator.ContextAnnotator`                                                                                                                                                                                                                                                                                                                  |
44
| custom               | Use `module.Class` directly, where `module` and `class` fields used to be specified in `args`. They should be removed there.                                                                                                                                                                                                                         |
45
| regexp               | `docdeid.process.RegexpAnnotator`                                                                                                                                                                                                                                                                                                                    |
46
47
# Migrating to version `2.0.0`
48
49
Version `2.0.0` of `deduce` sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply `deduce` to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below.
50
51
## Calling `deduce`
52
53
`deduce` is now called from `Deduce.deidentify`, which replaces the `annotate_text` and `deidentify_annotations` functions. Those functions will give a `DeprecationWarning` from version `2.0.0`, and will be deprecated from version `2.1.0`. 
54
55
<table>
56
<tr>
57
<th align="center" width="50%">deprecated</th>
58
<th align="center" width="50%">new</th>
59
</tr>
60
<tr>
61
<td>
62
63
```python
64
from deduce import annotate_text, deidentify_annotations
65
66
text = "Jan Jansen"
67
68
annotated_text = annotate_text(text)
69
deidentified_text = deidentify_annotations(annotated_text)
70
```
71
72
</td>
73
<td>
74
75
```python
76
from deduce import Deduce
77
78
text = "Jan Jansen"
79
80
deduce = Deduce()
81
doc = deduce.deidentify(text)   
82
```
83
84
</td>
85
</tr>
86
</table>
87
88
## Accessing output
89
90
The annotations and deidentified text are now available in the `Document` object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the `docdeid` library (note that the format has changed). 
91
92
<table>
93
<tr>
94
<th align="center" width="50%">deprecated</th>
95
<th align="center" width="50%">new</th>
96
</tr>
97
<tr>
98
<td>
99
100
```python
101
print(annotated_text)
102
'<PERSOON Jan Jansen>'
103
104
print(deidentified_text)
105
'<PERSOON-1>'
106
```
107
108
</td>
109
<td>
110
111
```python
112
import docdeid as dd
113
114
print(dd.utils.annotate_intext(doc))
115
'<PERSOON>Jan Jansen</PERSOON>'
116
117
print(doc.annotations)
118
AnnotationSet({
119
    Annotation(
120
        text="Jan Jansen", 
121
        start_char=0, 
122
        end_char=10, 
123
        tag="persoon", 
124
        length="10"
125
    )
126
})
127
128
print(doc.deidentified_text)
129
'<PERSOON-1>'
130
```
131
132
</td>
133
</tr>
134
</table>
135
136
## Adding patient names
137
138
The `patient_first_names`, `patient_initials`, `patient_surname` and `patient_given_name` keywords of `annotate_text` are replaced with a structured way to enter this information, in the `Person` class. This class can be passed to `deidentify()` as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same.
139
140
<table>
141
<tr>
142
<th align="center" width="50%">deprecated</th>
143
<th align="center" width="50%">new</th>
144
</tr>
145
<tr>
146
<td>
147
148
```python
149
from deduce import annotate_text, deidentify_annotations
150
151
text = "Jan Jansen"
152
153
annotated_text = annotate_text(
154
    text, 
155
    patient_first_names="Jan Hendrik", 
156
    patient_initials="JH", 
157
    patient_surname="Jansen", 
158
    patient_given_name="Joop"
159
)
160
deidentified_text = deidentify_annotations(annotated_text)
161
```
162
163
</td>
164
<td>
165
166
```python
167
from deduce import Deduce
168
from deduce.person import Person
169
170
text = "Jan Jansen"
171
patient = Person(
172
    first_names=['Jan', 'Hendrik', 'Joop'], 
173
    initials="JH", 
174
    surname="Jansen"
175
)
176
177
deduce = Deduce()
178
doc = deduce.deidentify(text, metadata={'patient': patient})   
179
```
180
181
</td>
182
</tr>
183
</table>
184
185
## Enabling/disabling specific categories
186
187
Previously, the `annotate_text` function offered disabling specific categories by using `dates`, `ages`, `names`, etc. keywords. This behaviour can be achieved by setting the `disabled` argument of the `Deduce.deidentify` method. Note that the identification logic of Deduce is now further split up into `Annotator` classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial [here](tutorial.md#annotators), and more information on enabling, disabling, replacing or modifying specific components [here](tutorial.md#customizing-deduce).
188
189
190
<table>
191
<tr>
192
<th align="center" width="50%">deprecated</th>
193
<th align="center" width="50%">new</th>
194
</tr>
195
<tr>
196
<td>
197
198
```python
199
from deduce import annotate_text, deidentify_annotations
200
201
text = "Jan Jansen"
202
203
annotated_text = annotate_text(
204
    text,
205
    dates=False,
206
    ages=False
207
)
208
deidentified_text = deidentify_annotations(annotated_text)
209
```
210
211
</td>
212
<td>
213
214
```python
215
from deduce import Deduce
216
217
text = "Jan Jansen"
218
219
deduce = Deduce()
220
doc = deduce.deidentify(
221
    text, 
222
    disabled={'dates', 'ages'}
223
)   
224
```
225
226
</td>
227
</tr>
228
</table>