|
a |
|
b/docs/source/migrating.md |
|
|
1 |
# Migrating to version `3.0.0` |
|
|
2 |
|
|
|
3 |
Version `3.0.0` of `deduce` includes many optimizations that allow more accurate de-identification, some already included in `2.1.0` - `2.5.0.` It also includes some structural optimizations. Version `3.0.0` should be backwards compatible, but some functionality is scheduled for removal in `3.1.0`. Those changes are listed below. |
|
|
4 |
|
|
|
5 |
## Custom config |
|
|
6 |
|
|
|
7 |
Adding a custom config is now possible as a `dict` or as a filename pointing to a `json`. Both should be presented to `deduce` with the `config` keyword, e.g.: |
|
|
8 |
|
|
|
9 |
```python |
|
|
10 |
deduce = Deduce(config='my_own_config.json') |
|
|
11 |
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'}) |
|
|
12 |
``` |
|
|
13 |
|
|
|
14 |
The `config_file` keyword is no longer used, please use `config` instead. |
|
|
15 |
|
|
|
16 |
## Lookup structure names |
|
|
17 |
|
|
|
18 |
For consistency, lookup structures names are now all in singular form: |
|
|
19 |
|
|
|
20 |
| **Old name** | **New name** | |
|
|
21 |
|-------------------------|------------------------| |
|
|
22 |
| prefixes | prefix | |
|
|
23 |
| first_names | first_name | |
|
|
24 |
| interfixes | interfixes | |
|
|
25 |
| interfix_surnames | interfix_surname | |
|
|
26 |
| surnames | surname | |
|
|
27 |
| streets | street | |
|
|
28 |
| placenames | placename | |
|
|
29 |
| hospitals | hospital | |
|
|
30 |
| healthcare_institutions | healthcare_institution | |
|
|
31 |
|
|
|
32 |
Additionally, the `first_name_exceptions` and `surname_exceptions` list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc. |
|
|
33 |
|
|
|
34 |
## The `annotator_type` field in config |
|
|
35 |
|
|
|
36 |
In a config, each each annotator should specify `annotator_type`, so `Deduce` knows what annotator to load. In `3.0.0` we simplified this a bit. In most cases, the `annotator_type` field should be set to `module.Class` of the annotator that should be loaded, and `Deduce` will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes: |
|
|
37 |
|
|
|
38 |
| **annotator_type** | **Change** | |
|
|
39 |
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
|
40 |
| multi_token | `docdeid.process.MultiTokenLookupAnnotator` | |
|
|
41 |
| dd_token_pattern | This used to load `docdeid.process.TokenPatternAnnotator`, but this is now replaced by `deduce.annotator.TokenPatternAnnotator`. The latter is more poweful, but needs a different pattern. A `docdeid.process.TokenPatternAnnotator` can no longer be loaded through config, although adding it manually to `Deduce.processors` is always possible. | |
|
|
42 |
| token_pattern | `deduce.annotator.TokenPatternAnnotator` | |
|
|
43 |
| annotation_context | `deduce.annotator.ContextAnnotator` | |
|
|
44 |
| custom | Use `module.Class` directly, where `module` and `class` fields used to be specified in `args`. They should be removed there. | |
|
|
45 |
| regexp | `docdeid.process.RegexpAnnotator` | |
|
|
46 |
|
|
|
47 |
# Migrating to version `2.0.0` |
|
|
48 |
|
|
|
49 |
Version `2.0.0` of `deduce` sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply `deduce` to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below. |
|
|
50 |
|
|
|
51 |
## Calling `deduce` |
|
|
52 |
|
|
|
53 |
`deduce` is now called from `Deduce.deidentify`, which replaces the `annotate_text` and `deidentify_annotations` functions. Those functions will give a `DeprecationWarning` from version `2.0.0`, and will be deprecated from version `2.1.0`. |
|
|
54 |
|
|
|
55 |
<table> |
|
|
56 |
<tr> |
|
|
57 |
<th align="center" width="50%">deprecated</th> |
|
|
58 |
<th align="center" width="50%">new</th> |
|
|
59 |
</tr> |
|
|
60 |
<tr> |
|
|
61 |
<td> |
|
|
62 |
|
|
|
63 |
```python |
|
|
64 |
from deduce import annotate_text, deidentify_annotations |
|
|
65 |
|
|
|
66 |
text = "Jan Jansen" |
|
|
67 |
|
|
|
68 |
annotated_text = annotate_text(text) |
|
|
69 |
deidentified_text = deidentify_annotations(annotated_text) |
|
|
70 |
``` |
|
|
71 |
|
|
|
72 |
</td> |
|
|
73 |
<td> |
|
|
74 |
|
|
|
75 |
```python |
|
|
76 |
from deduce import Deduce |
|
|
77 |
|
|
|
78 |
text = "Jan Jansen" |
|
|
79 |
|
|
|
80 |
deduce = Deduce() |
|
|
81 |
doc = deduce.deidentify(text) |
|
|
82 |
``` |
|
|
83 |
|
|
|
84 |
</td> |
|
|
85 |
</tr> |
|
|
86 |
</table> |
|
|
87 |
|
|
|
88 |
## Accessing output |
|
|
89 |
|
|
|
90 |
The annotations and deidentified text are now available in the `Document` object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the `docdeid` library (note that the format has changed). |
|
|
91 |
|
|
|
92 |
<table> |
|
|
93 |
<tr> |
|
|
94 |
<th align="center" width="50%">deprecated</th> |
|
|
95 |
<th align="center" width="50%">new</th> |
|
|
96 |
</tr> |
|
|
97 |
<tr> |
|
|
98 |
<td> |
|
|
99 |
|
|
|
100 |
```python |
|
|
101 |
print(annotated_text) |
|
|
102 |
'<PERSOON Jan Jansen>' |
|
|
103 |
|
|
|
104 |
print(deidentified_text) |
|
|
105 |
'<PERSOON-1>' |
|
|
106 |
``` |
|
|
107 |
|
|
|
108 |
</td> |
|
|
109 |
<td> |
|
|
110 |
|
|
|
111 |
```python |
|
|
112 |
import docdeid as dd |
|
|
113 |
|
|
|
114 |
print(dd.utils.annotate_intext(doc)) |
|
|
115 |
'<PERSOON>Jan Jansen</PERSOON>' |
|
|
116 |
|
|
|
117 |
print(doc.annotations) |
|
|
118 |
AnnotationSet({ |
|
|
119 |
Annotation( |
|
|
120 |
text="Jan Jansen", |
|
|
121 |
start_char=0, |
|
|
122 |
end_char=10, |
|
|
123 |
tag="persoon", |
|
|
124 |
length="10" |
|
|
125 |
) |
|
|
126 |
}) |
|
|
127 |
|
|
|
128 |
print(doc.deidentified_text) |
|
|
129 |
'<PERSOON-1>' |
|
|
130 |
``` |
|
|
131 |
|
|
|
132 |
</td> |
|
|
133 |
</tr> |
|
|
134 |
</table> |
|
|
135 |
|
|
|
136 |
## Adding patient names |
|
|
137 |
|
|
|
138 |
The `patient_first_names`, `patient_initials`, `patient_surname` and `patient_given_name` keywords of `annotate_text` are replaced with a structured way to enter this information, in the `Person` class. This class can be passed to `deidentify()` as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same. |
|
|
139 |
|
|
|
140 |
<table> |
|
|
141 |
<tr> |
|
|
142 |
<th align="center" width="50%">deprecated</th> |
|
|
143 |
<th align="center" width="50%">new</th> |
|
|
144 |
</tr> |
|
|
145 |
<tr> |
|
|
146 |
<td> |
|
|
147 |
|
|
|
148 |
```python |
|
|
149 |
from deduce import annotate_text, deidentify_annotations |
|
|
150 |
|
|
|
151 |
text = "Jan Jansen" |
|
|
152 |
|
|
|
153 |
annotated_text = annotate_text( |
|
|
154 |
text, |
|
|
155 |
patient_first_names="Jan Hendrik", |
|
|
156 |
patient_initials="JH", |
|
|
157 |
patient_surname="Jansen", |
|
|
158 |
patient_given_name="Joop" |
|
|
159 |
) |
|
|
160 |
deidentified_text = deidentify_annotations(annotated_text) |
|
|
161 |
``` |
|
|
162 |
|
|
|
163 |
</td> |
|
|
164 |
<td> |
|
|
165 |
|
|
|
166 |
```python |
|
|
167 |
from deduce import Deduce |
|
|
168 |
from deduce.person import Person |
|
|
169 |
|
|
|
170 |
text = "Jan Jansen" |
|
|
171 |
patient = Person( |
|
|
172 |
first_names=['Jan', 'Hendrik', 'Joop'], |
|
|
173 |
initials="JH", |
|
|
174 |
surname="Jansen" |
|
|
175 |
) |
|
|
176 |
|
|
|
177 |
deduce = Deduce() |
|
|
178 |
doc = deduce.deidentify(text, metadata={'patient': patient}) |
|
|
179 |
``` |
|
|
180 |
|
|
|
181 |
</td> |
|
|
182 |
</tr> |
|
|
183 |
</table> |
|
|
184 |
|
|
|
185 |
## Enabling/disabling specific categories |
|
|
186 |
|
|
|
187 |
Previously, the `annotate_text` function offered disabling specific categories by using `dates`, `ages`, `names`, etc. keywords. This behaviour can be achieved by setting the `disabled` argument of the `Deduce.deidentify` method. Note that the identification logic of Deduce is now further split up into `Annotator` classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial [here](tutorial.md#annotators), and more information on enabling, disabling, replacing or modifying specific components [here](tutorial.md#customizing-deduce). |
|
|
188 |
|
|
|
189 |
|
|
|
190 |
<table> |
|
|
191 |
<tr> |
|
|
192 |
<th align="center" width="50%">deprecated</th> |
|
|
193 |
<th align="center" width="50%">new</th> |
|
|
194 |
</tr> |
|
|
195 |
<tr> |
|
|
196 |
<td> |
|
|
197 |
|
|
|
198 |
```python |
|
|
199 |
from deduce import annotate_text, deidentify_annotations |
|
|
200 |
|
|
|
201 |
text = "Jan Jansen" |
|
|
202 |
|
|
|
203 |
annotated_text = annotate_text( |
|
|
204 |
text, |
|
|
205 |
dates=False, |
|
|
206 |
ages=False |
|
|
207 |
) |
|
|
208 |
deidentified_text = deidentify_annotations(annotated_text) |
|
|
209 |
``` |
|
|
210 |
|
|
|
211 |
</td> |
|
|
212 |
<td> |
|
|
213 |
|
|
|
214 |
```python |
|
|
215 |
from deduce import Deduce |
|
|
216 |
|
|
|
217 |
text = "Jan Jansen" |
|
|
218 |
|
|
|
219 |
deduce = Deduce() |
|
|
220 |
doc = deduce.deidentify( |
|
|
221 |
text, |
|
|
222 |
disabled={'dates', 'ages'} |
|
|
223 |
) |
|
|
224 |
``` |
|
|
225 |
|
|
|
226 |
</td> |
|
|
227 |
</tr> |
|
|
228 |
</table> |