deduce / Git / Diff of /docs/source/migrating.md

Models:

philipB/

deduce

Downloads: 1

Diff of /docs/source/migrating.md [000000] .. [79668b]

Switch to unified view

 b/docs/source/migrating.md
+# Migrating to version `3.0.0`
+Version `3.0.0` of `deduce` includes many optimizations that allow more accurate de-identification, some already included in `2.1.0` - `2.5.0.` It also includes some structural optimizations. Version `3.0.0` should be backwards compatible, but some functionality is scheduled for removal in `3.1.0`. Those changes are listed below.
+## Custom config
+Adding a custom config is now possible as a `dict` or as a filename pointing to a `json`. Both should be presented to `deduce` with the `config` keyword, e.g.:
+```python
+deduce = Deduce(config='my_own_config.json')
+deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})
+```
+The `config_file` keyword is no longer used, please use `config` instead.
+## Lookup structure names
+For consistency, lookup structures names are now all in singular form:
+| **Old name**            | **New name**           |
+|-------------------------|------------------------|
+| prefixes                | prefix                 |
+| first_names             | first_name             |
+| interfixes              | interfixes             |
+| interfix_surnames       | interfix_surname       |
+| surnames                | surname                |
+| streets                 | street                 |
+| placenames              | placename              |
+| hospitals               | hospital               |
+| healthcare_institutions | healthcare_institution |
+Additionally, the `first_name_exceptions` and `surname_exceptions` list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc.
+## The `annotator_type` field in config
+In a config, each each annotator should specify `annotator_type`, so `Deduce` knows what annotator to load. In `3.0.0` we simplified this a bit. In most cases, the `annotator_type` field should be set to `module.Class` of the annotator that should be loaded, and `Deduce` will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes:
+| **annotator_type**   | **Change**                                                                                                                                                                                                                                                                                                                                           |
+|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| multi_token          | `docdeid.process.MultiTokenLookupAnnotator`                                                                                                                                                                                                                                                                                                          |
+| dd_token_pattern     | This used to load `docdeid.process.TokenPatternAnnotator`, but this is now replaced by `deduce.annotator.TokenPatternAnnotator`. The latter is more poweful, but needs a different pattern. A `docdeid.process.TokenPatternAnnotator` can no longer be loaded through config, although adding it manually to `Deduce.processors` is always possible. |
+| token_pattern        | `deduce.annotator.TokenPatternAnnotator`                                                                                                                                                                                                                                                                                                             |
+| annotation_context   | `deduce.annotator.ContextAnnotator`                                                                                                                                                                                                                                                                                                                  |
+| custom               | Use `module.Class` directly, where `module` and `class` fields used to be specified in `args`. They should be removed there.                                                                                                                                                                                                                         |
+| regexp               | `docdeid.process.RegexpAnnotator`                                                                                                                                                                                                                                                                                                                    |
+# Migrating to version `2.0.0`
+Version `2.0.0` of `deduce` sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply `deduce` to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below.
+## Calling `deduce`
+`deduce` is now called from `Deduce.deidentify`, which replaces the `annotate_text` and `deidentify_annotations` functions. Those functions will give a `DeprecationWarning` from version `2.0.0`, and will be deprecated from version `2.1.0`.
+<table>
+<tr>
+<th align="center" width="50%">deprecated</th>
+<th align="center" width="50%">new</th>
+</tr>
+<tr>
+<td>
+```python
+from deduce import annotate_text, deidentify_annotations
+text = "Jan Jansen"
+annotated_text = annotate_text(text)
+deidentified_text = deidentify_annotations(annotated_text)
+```
+</td>
+<td>
+```python
+from deduce import Deduce
+text = "Jan Jansen"
+deduce = Deduce()
+doc = deduce.deidentify(text)
+```
+</td>
+</tr>
+</table>
+## Accessing output
+The annotations and deidentified text are now available in the `Document` object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the `docdeid` library (note that the format has changed).
+<table>
+<tr>
+<th align="center" width="50%">deprecated</th>
+<th align="center" width="50%">new</th>
+</tr>
+<tr>
+<td>
+```python
+print(annotated_text)
+'<PERSOON Jan Jansen>'
+print(deidentified_text)
+'<PERSOON-1>'
+```
+</td>
+<td>
+```python
+import docdeid as dd
+print(dd.utils.annotate_intext(doc))
+'<PERSOON>Jan Jansen</PERSOON>'
+print(doc.annotations)
+AnnotationSet({
+    Annotation(
+        text="Jan Jansen",
+        start_char=0,
+        end_char=10,
+        tag="persoon",
+        length="10"
+    )
+})
+print(doc.deidentified_text)
+'<PERSOON-1>'
+```
+</td>
+</tr>
+</table>
+## Adding patient names
+The `patient_first_names`, `patient_initials`, `patient_surname` and `patient_given_name` keywords of `annotate_text` are replaced with a structured way to enter this information, in the `Person` class. This class can be passed to `deidentify()` as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same.
+<table>
+<tr>
+<th align="center" width="50%">deprecated</th>
+<th align="center" width="50%">new</th>
+</tr>
+<tr>
+<td>
+```python
+from deduce import annotate_text, deidentify_annotations
+text = "Jan Jansen"
+annotated_text = annotate_text(
+    text,
+    patient_first_names="Jan Hendrik",
+    patient_initials="JH",
+    patient_surname="Jansen",
+    patient_given_name="Joop"
+)
+deidentified_text = deidentify_annotations(annotated_text)
+```
+</td>
+<td>
+```python
+from deduce import Deduce
+from deduce.person import Person
+text = "Jan Jansen"
+patient = Person(
+    first_names=['Jan', 'Hendrik', 'Joop'],
+    initials="JH",
+    surname="Jansen"
+)
+deduce = Deduce()
+doc = deduce.deidentify(text, metadata={'patient': patient})
+```
+</td>
+</tr>
+</table>
+## Enabling/disabling specific categories
+Previously, the `annotate_text` function offered disabling specific categories by using `dates`, `ages`, `names`, etc. keywords. This behaviour can be achieved by setting the `disabled` argument of the `Deduce.deidentify` method. Note that the identification logic of Deduce is now further split up into `Annotator` classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial [here](tutorial.md#annotators), and more information on enabling, disabling, replacing or modifying specific components [here](tutorial.md#customizing-deduce).
+<table>
+<tr>
+<th align="center" width="50%">deprecated</th>
+<th align="center" width="50%">new</th>
+</tr>
+<tr>
+<td>
+```python
+from deduce import annotate_text, deidentify_annotations
+text = "Jan Jansen"
+annotated_text = annotate_text(
+    text,
+    dates=False,
+    ages=False
+)
+deidentified_text = deidentify_annotations(annotated_text)
+```
+</td>
+<td>
+```python
+from deduce import Deduce
+text = "Jan Jansen"
+deduce = Deduce()
+doc = deduce.deidentify(
+    text,
+    disabled={'dates', 'ages'}
+)
+```
+</td>
+</tr>
+</table>