|
a |
|
b/CHANGELOG.md |
|
|
1 |
# Changelog |
|
|
2 |
|
|
|
3 |
All notable changes to this project will be documented in this file. |
|
|
4 |
|
|
|
5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), |
|
|
6 |
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
|
|
7 |
|
|
|
8 |
## 3.0.3 (2024-07-16) |
|
|
9 |
|
|
|
10 |
### Added |
|
|
11 |
|
|
|
12 |
- A cache_path option, to define the path for saving/loading the lookup structure cache. You should use this if your install directory is not writable. |
|
|
13 |
|
|
|
14 |
### Removed |
|
|
15 |
|
|
|
16 |
- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts |
|
|
17 |
- old lookup list names, e.g. `prefixes` now replaced by `prefix` |
|
|
18 |
- annotator types `custom`, `regexp`, `token_pattern`, `dd_token_pattern` and `annotation_context`, all replaced by setting class directly as `annotator_type` |
|
|
19 |
- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator` |
|
|
20 |
|
|
|
21 |
## 3.0.2 (2024-02-15) |
|
|
22 |
|
|
|
23 |
### Changed |
|
|
24 |
- recognize 4+ spaces as a token, blocking annotations |
|
|
25 |
|
|
|
26 |
## 3.0.1 (2023-12-20) |
|
|
27 |
|
|
|
28 |
### Fixed |
|
|
29 |
- a bug with packaging `base_config.json` |
|
|
30 |
|
|
|
31 |
## 3.0.0 (2023-12-20) |
|
|
32 |
|
|
|
33 |
### Added |
|
|
34 |
- speed optimizations, ~250% |
|
|
35 |
- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob) |
|
|
36 |
- `PatientNameAnnotator`, which replaces `deduce.pattern` |
|
|
37 |
- a structured way for loading and building lookup structures (lists and tries), including caching |
|
|
38 |
- `pre_match_words` for some regexp annotators, speeding up the annotating |
|
|
39 |
- option to present a user config as dict (using `config` keyword) |
|
|
40 |
|
|
|
41 |
### Changed |
|
|
42 |
- speedup for `TokenPatternAnnotator` |
|
|
43 |
- some internals of `ContextPatternAnnotator` |
|
|
44 |
- initials now detected by lookup list, rather than pattern |
|
|
45 |
- redactor open and close chars from `<` `>` to `[` `]`, as previous chars caused issues in html (so deidentified text now shows `[PATIENT]`, `[LOCATIE]`, etc.) |
|
|
46 |
- names of lookup structures to singular (`prefix`, rather than `prefixes`) |
|
|
47 |
- `INSTELLING` tag to `ZIEKENHUIS` and `ZORGINSTELLING` |
|
|
48 |
- refactored and simplified annotator loading, specifically the `annotator_type` config keyword now accepts references to classes (e.g `deduce.annotator.TokenPatternAnnotator`) |
|
|
49 |
- renamed `interfix_with_capital` annotator to `interfix_with_name` |
|
|
50 |
|
|
|
51 |
### Deprecated |
|
|
52 |
- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts |
|
|
53 |
- old lookup list names, e.g. `prefixes` now replaced by `prefix` |
|
|
54 |
- annotator types `custom`, `regexp`, `token_pattern`, `dd_token_pattern` and `annotation_context`, all replaced by setting class directly as `annotator_type` |
|
|
55 |
- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator` |
|
|
56 |
|
|
|
57 |
### Removed |
|
|
58 |
- automated coverage reporting on coveralls.io |
|
|
59 |
- options `lowercase_lookup`, `lowercase_neg_lookup` for token patterns |
|
|
60 |
- `utils.any_in_text` |
|
|
61 |
|
|
|
62 |
### Fixed |
|
|
63 |
- some small additions/removals for specific lookup lists |
|
|
64 |
- smaller bugs related to overlapping matches |
|
|
65 |
|
|
|
66 |
## 2.5.0 (2023-11-28) |
|
|
67 |
|
|
|
68 |
### Added |
|
|
69 |
- the `RegexpPseudoAnnotator` component for filtering regexp matches based on preceding/following words |
|
|
70 |
- a `prefix_with_interfix` pattern for names, detecting e.g. `Dr. van Loon` |
|
|
71 |
|
|
|
72 |
### Changed |
|
|
73 |
- the age detection component, with improved logic and pseudo patterns |
|
|
74 |
- annotations are no longer counted adjacent when separated by a comma |
|
|
75 |
- streets are prioritized over names when merging overlapping annotations |
|
|
76 |
- removed some false positives for postal codes ending in `gr` or `ie` |
|
|
77 |
- extended the postbus pattern for `xx.xxx` format (old notation) |
|
|
78 |
- some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists |
|
|
79 |
|
|
|
80 |
### Fixed |
|
|
81 |
- a bug with `BsnAnnotator` with non-digit characters in regexp |
|
|
82 |
|
|
|
83 |
## 2.4.2 (2023-11-22) |
|
|
84 |
|
|
|
85 |
### Changed |
|
|
86 |
- multi-token lookup for first- and last names, so multi token names are now detected |
|
|
87 |
- some small lookup list additions |
|
|
88 |
|
|
|
89 |
## 2.4.3 (2023-11-22) |
|
|
90 |
|
|
|
91 |
### Changed |
|
|
92 |
- extended list of medical terms |
|
|
93 |
|
|
|
94 |
## 2.4.2 (2023-11-21) |
|
|
95 |
|
|
|
96 |
### Changed |
|
|
97 |
- name lookup list contents, extending names and adding more exceptions |
|
|
98 |
|
|
|
99 |
## 2.4.1 (2023-11-15) |
|
|
100 |
|
|
|
101 |
### Added |
|
|
102 |
- detection of initials `Ch.`, `Chr.`, `Ph.` and `Th.` |
|
|
103 |
|
|
|
104 |
## 2.4.0 (2023-11-15) |
|
|
105 |
|
|
|
106 |
### Added |
|
|
107 |
- logic for detecting hospitals, with added whitelist and separate annotator |
|
|
108 |
|
|
|
109 |
### Changed |
|
|
110 |
- logic for detecting (non-hospital) institutions, with extended lookup list |
|
|
111 |
|
|
|
112 |
### Removed |
|
|
113 |
- the separate Altrecht annotator, now included in the lookup list |
|
|
114 |
|
|
|
115 |
## 2.3.1 (2023-11-01) |
|
|
116 |
|
|
|
117 |
### Fixed |
|
|
118 |
- include data files recursively in package |
|
|
119 |
|
|
|
120 |
## 2.3.0 (2023-10-25) |
|
|
121 |
|
|
|
122 |
### Added |
|
|
123 |
- lookup lists (and logic) for Dutch provinces, regions, municipalities and streets |
|
|
124 |
|
|
|
125 |
### Changed |
|
|
126 |
- name of `residences` annotator to `placenames`, now includes provinces, regions and municipalities |
|
|
127 |
- lookup lists (and logic) for residences |
|
|
128 |
- logic for streets, housenumber and housenumber letters |
|
|
129 |
|
|
|
130 |
## 2.2.0 (2023-09-28) |
|
|
131 |
|
|
|
132 |
### Changed |
|
|
133 |
- tokenizer logic: |
|
|
134 |
- a token is now a sequence of alphanumeric characters, a single newline, or a single special character. |
|
|
135 |
- whitespaces are no longer considered tokens |
|
|
136 |
- moved token pattern logic to config, using a new `TokenPatternAnnotator` |
|
|
137 |
- moved context pattern logic to config, using a new `ContextAnnotator` |
|
|
138 |
- many updates to name detection logic |
|
|
139 |
- lookup list optimizations |
|
|
140 |
- added, removed and simplified patterns |
|
|
141 |
|
|
|
142 |
## 2.1.0 (2023-08-07) |
|
|
143 |
|
|
|
144 |
### Added |
|
|
145 |
- a component for deidentifying BSN-nummers |
|
|
146 |
|
|
|
147 |
### Changed |
|
|
148 |
- updated dependencies |
|
|
149 |
- by default, deduce now recognizes and tags bsn nummers |
|
|
150 |
- by default, deduce now recognizes all other 7+ digit numbers as identifiers |
|
|
151 |
- improved regular expressions for e-mail address and url matching, with separate tags |
|
|
152 |
- logic for detecting phone numbers (improvements for hyphens, whitespaces, false positive identifiers) |
|
|
153 |
- improved regular expression for age matching |
|
|
154 |
- date detection logic: |
|
|
155 |
- now only recognizes combinations of day, month and year (day/month combinations caused many false positives) |
|
|
156 |
- detects year-month-day format in addition to (day-month-year) |
|
|
157 |
- loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config |
|
|
158 |
|
|
|
159 |
### Deprecated |
|
|
160 |
- backwards compatibility, which was temporary added to transition from v1 to v2 |
|
|
161 |
|
|
|
162 |
### Removed |
|
|
163 |
- a separate patient identifier tag, now superseded by a generic tag |
|
|
164 |
- detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores) |
|
|
165 |
|
|
|
166 |
### Fixed |
|
|
167 |
- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged) |
|
|
168 |
|
|
|
169 |
## 2.0.3 (2023-04-06) |
|
|
170 |
|
|
|
171 |
### Fixed |
|
|
172 |
- removed 'decibutus' from list of institutions as it caused many false positives |
|
|
173 |
|
|
|
174 |
## 2.0.2 (2023-03-28) |
|
|
175 |
|
|
|
176 |
### Changed |
|
|
177 |
- upgraded dependencies, including `markdown-it-py` which had a vulnerability |
|
|
178 |
|
|
|
179 |
## 2.0.1 (2022-12-09) |
|
|
180 |
|
|
|
181 |
### Changed |
|
|
182 |
- upgraded dependencies |
|
|
183 |
|
|
|
184 |
## 2.0.0 (2022-12-05) |
|
|
185 |
|
|
|
186 |
### Added |
|
|
187 |
- introduced new interface for deidentification, using `Deduce()` class |
|
|
188 |
- a separate documentation page, with tutorial and migration guide |
|
|
189 |
- support for python 3.10 and 3.11 |
|
|
190 |
|
|
|
191 |
### Changed |
|
|
192 |
- major refactor that touches pretty much every line of code |
|
|
193 |
- use `docdeid` package for logic |
|
|
194 |
- speedups: now 973% faster |
|
|
195 |
- use lookup sets instead of lookup lists |
|
|
196 |
- refactor tokenizer |
|
|
197 |
- refactor annotators into separate classes, using structured annotations |
|
|
198 |
- guidelines for contributing |
|
|
199 |
|
|
|
200 |
### Removed |
|
|
201 |
- the `annotate_text` and `deidentify_annotations` functions |
|
|
202 |
- all in-text annotation (under the hood) and associated functions |
|
|
203 |
- support for given names. given names can be added as another first name in the `Person` class. |
|
|
204 |
- support for python 3.7 and 3.8 |
|
|
205 |
|
|
|
206 |
### Fixed |
|
|
207 |
- `<` and `>` are no longer replaced by `(` and `)` respectively |
|
|
208 |
- deduce does not strip text (whitespaces, tabs at beginning/end of text) anymore |
|
|
209 |
|
|
|
210 |
## 1.0.8 (2021-11-29) |
|
|
211 |
|
|
|
212 |
### Added |
|
|
213 |
- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation |
|
|
214 |
|
|
|
215 |
### Fixed |
|
|
216 |
- various modifications related to adding or subtracting spaces in annotated texts |
|
|
217 |
- remove the lowercasing of institutions' names |
|
|
218 |
- therefore, all structured annotations have texts matching the original text in the same span |
|
|
219 |
|
|
|
220 |
## 1.0.7 (2021-11-03) |
|
|
221 |
|
|
|
222 |
### Changed |
|
|
223 |
- Internal code formatting improvements |
|
|
224 |
|
|
|
225 |
### Added |
|
|
226 |
- Contributing guidelines |
|
|
227 |
|
|
|
228 |
## 1.0.6 (2021-10-06) |
|
|
229 |
|
|
|
230 |
### Fixed |
|
|
231 |
- Bug with multiple 4-digit mg dosages in one text |
|
|
232 |
|
|
|
233 |
## 1.0.5 (2021-10-05) |
|
|
234 |
|
|
|
235 |
### Fixed |
|
|
236 |
- Minor bug where tag flattening had no effect |
|
|
237 |
|
|
|
238 |
## 1.0.4 (2021-10-05) |
|
|
239 |
|
|
|
240 |
### Added |
|
|
241 |
- Changelog |
|
|
242 |
- Additional unit tests for whitespace/punctuation |
|
|
243 |
|
|
|
244 |
### Fixed |
|
|
245 |
- Various whitespace/punctuation bugs |
|
|
246 |
- Bug with nested tags not related to person names |
|
|
247 |
- Bug with adjacent tags not being merged |
|
|
248 |
|
|
|
249 |
## 1.0.3 (2021-07-07) |
|
|
250 |
|
|
|
251 |
### Added |
|
|
252 |
- Structured annotations |
|
|
253 |
- Some unit tests |
|
|
254 |
|
|
|
255 |
### Fixed |
|
|
256 |
- Error with outdated unicode package |
|
|
257 |
- Bug with context |
|
|
258 |
|
|
|
259 |
## 1.0.2 |
|
|
260 |
Release to PyPI |
|
|
261 |
|
|
|
262 |
## 1.0.1 |
|
|
263 |
Small bugfix for None as input |
|
|
264 |
|
|
|
265 |
## 1.0.0 |
|
|
266 |
Initial version |