Diff of /CHANGELOG.md [000000] .. [79668b]

Switch to unified view

a b/CHANGELOG.md
1
# Changelog
2
3
All notable changes to this project will be documented in this file.
4
5
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
8
## 3.0.3 (2024-07-16)
9
10
### Added
11
12
- A cache_path option, to define the path for saving/loading the lookup structure cache. You should use this if your install directory is not writable.
13
14
### Removed
15
16
- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts
17
- old lookup list names, e.g. `prefixes` now replaced by `prefix`
18
- annotator types `custom`, `regexp`, `token_pattern`, `dd_token_pattern` and `annotation_context`, all replaced by setting class directly as `annotator_type` 
19
- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator`
20
21
## 3.0.2 (2024-02-15)
22
23
### Changed
24
- recognize 4+ spaces as a token, blocking annotations
25
26
## 3.0.1 (2023-12-20)
27
28
### Fixed
29
- a bug with packaging `base_config.json`
30
31
## 3.0.0 (2023-12-20)
32
33
### Added
34
- speed optimizations, ~250%
35
- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)
36
- `PatientNameAnnotator`, which replaces `deduce.pattern`
37
- a structured way for loading and building lookup structures (lists and tries), including caching
38
- `pre_match_words` for some regexp annotators, speeding up the annotating
39
- option to present a user config as dict (using `config` keyword)
40
41
### Changed
42
- speedup for `TokenPatternAnnotator`
43
- some internals of `ContextPatternAnnotator`
44
- initials now detected by lookup list, rather than pattern
45
- redactor open and close chars from `<` `>` to `[` `]`, as previous chars caused issues in html (so deidentified text now shows `[PATIENT]`, `[LOCATIE]`, etc.)
46
- names of lookup structures to singular (`prefix`, rather than `prefixes`)
47
- `INSTELLING` tag to `ZIEKENHUIS` and `ZORGINSTELLING`
48
- refactored and simplified annotator loading, specifically the `annotator_type` config keyword now accepts references to classes (e.g `deduce.annotator.TokenPatternAnnotator`)
49
- renamed `interfix_with_capital` annotator to `interfix_with_name` 
50
51
### Deprecated
52
- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts
53
- old lookup list names, e.g. `prefixes` now replaced by `prefix`
54
- annotator types `custom`, `regexp`, `token_pattern`, `dd_token_pattern` and `annotation_context`, all replaced by setting class directly as `annotator_type` 
55
- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator`
56
57
### Removed
58
- automated coverage reporting on coveralls.io
59
- options `lowercase_lookup`, `lowercase_neg_lookup` for token patterns
60
- `utils.any_in_text` 
61
62
### Fixed
63
- some small additions/removals for specific lookup lists
64
- smaller bugs related to overlapping matches
65
66
## 2.5.0 (2023-11-28)
67
68
### Added
69
- the `RegexpPseudoAnnotator` component for filtering regexp matches based on preceding/following words
70
- a `prefix_with_interfix` pattern for names, detecting e.g. `Dr. van Loon`
71
72
### Changed
73
- the age detection component, with improved logic and pseudo patterns
74
- annotations are no longer counted adjacent when separated by a comma
75
- streets are prioritized over names when merging overlapping annotations
76
- removed some false positives for postal codes ending in `gr` or `ie`
77
- extended the postbus pattern for `xx.xxx` format (old notation)
78
- some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists
79
80
### Fixed
81
- a bug with `BsnAnnotator` with non-digit characters in regexp
82
83
## 2.4.2 (2023-11-22)
84
85
### Changed
86
- multi-token lookup for first- and last names, so multi token names are now detected
87
- some small lookup list additions
88
89
## 2.4.3 (2023-11-22)
90
91
### Changed
92
- extended list of medical terms
93
94
## 2.4.2 (2023-11-21)
95
96
### Changed
97
- name lookup list contents, extending names and adding more exceptions
98
99
## 2.4.1 (2023-11-15)
100
101
### Added
102
- detection of initials `Ch.`, `Chr.`, `Ph.` and `Th.` 
103
104
## 2.4.0 (2023-11-15)
105
106
### Added
107
- logic for detecting hospitals, with added whitelist and separate annotator
108
109
### Changed
110
- logic for detecting (non-hospital) institutions, with extended lookup list
111
112
### Removed
113
- the separate Altrecht annotator, now included in the lookup list
114
115
## 2.3.1 (2023-11-01)
116
117
### Fixed
118
- include data files recursively in package
119
120
## 2.3.0 (2023-10-25)
121
122
### Added
123
- lookup lists (and logic) for Dutch provinces, regions, municipalities and streets
124
125
### Changed
126
- name of `residences` annotator to `placenames`, now includes provinces, regions and municipalities
127
- lookup lists (and logic) for residences
128
- logic for streets, housenumber and housenumber letters
129
130
## 2.2.0 (2023-09-28)
131
132
### Changed
133
- tokenizer logic: 
134
  - a token is now a sequence of alphanumeric characters, a single newline, or a single special character. 
135
  - whitespaces are no longer considered tokens
136
- moved token pattern logic to config, using a new `TokenPatternAnnotator`
137
- moved context pattern logic to config, using a new `ContextAnnotator`
138
- many updates to name detection logic
139
  - lookup list optimizations
140
  - added, removed and simplified patterns
141
142
## 2.1.0 (2023-08-07)
143
144
### Added
145
- a component for deidentifying BSN-nummers
146
147
### Changed
148
- updated dependencies
149
- by default, deduce now recognizes and tags bsn nummers
150
- by default, deduce now recognizes all other 7+ digit numbers as identifiers
151
- improved regular expressions for e-mail address and url matching, with separate tags
152
- logic for detecting phone numbers (improvements for hyphens, whitespaces, false positive identifiers)
153
- improved regular expression for age matching
154
- date detection logic:
155
  - now only recognizes combinations of day, month and year (day/month combinations caused many false positives)
156
  - detects year-month-day format in addition to (day-month-year)
157
- loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config
158
159
### Deprecated
160
- backwards compatibility, which was temporary added to transition from v1 to v2
161
162
### Removed
163
- a separate patient identifier tag, now superseded by a generic tag
164
- detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores) 
165
166
### Fixed
167
- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)
168
169
## 2.0.3 (2023-04-06)
170
171
### Fixed
172
- removed 'decibutus' from list of institutions as it caused many false positives
173
174
## 2.0.2 (2023-03-28)
175
176
### Changed
177
- upgraded dependencies, including `markdown-it-py` which had a vulnerability
178
179
## 2.0.1 (2022-12-09)
180
181
### Changed
182
- upgraded dependencies
183
184
## 2.0.0 (2022-12-05)
185
186
### Added
187
- introduced new interface for deidentification, using `Deduce()` class
188
- a separate documentation page, with tutorial and migration guide
189
- support for python 3.10 and 3.11
190
191
### Changed
192
- major refactor that touches pretty much every line of code
193
- use `docdeid` package for logic
194
- speedups: now 973% faster
195
- use lookup sets instead of lookup lists
196
- refactor tokenizer
197
- refactor annotators into separate classes, using structured annotations
198
- guidelines for contributing
199
200
### Removed
201
- the `annotate_text` and `deidentify_annotations` functions
202
- all in-text annotation (under the hood) and associated functions
203
- support for given names. given names can be added as another first name in the `Person` class. 
204
- support for python 3.7 and 3.8
205
206
### Fixed
207
- `<` and `>` are no longer replaced by `(` and `)` respectively
208
- deduce does not strip text (whitespaces, tabs at beginning/end of text) anymore
209
210
## 1.0.8 (2021-11-29)
211
212
### Added
213
- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation
214
215
### Fixed
216
- various modifications related to adding or subtracting spaces in annotated texts
217
- remove the lowercasing of institutions' names
218
- therefore, all structured annotations have texts matching the original text in the same span
219
220
## 1.0.7 (2021-11-03)
221
222
### Changed
223
- Internal code formatting improvements
224
225
### Added
226
- Contributing guidelines
227
228
## 1.0.6 (2021-10-06)
229
230
### Fixed
231
- Bug with multiple 4-digit mg dosages in one text
232
233
## 1.0.5 (2021-10-05)
234
235
### Fixed
236
- Minor bug where tag flattening had no effect
237
238
## 1.0.4 (2021-10-05)
239
240
### Added
241
- Changelog
242
- Additional unit tests for whitespace/punctuation
243
244
### Fixed
245
- Various whitespace/punctuation bugs
246
- Bug with nested tags not related to person names
247
- Bug with adjacent tags not being merged
248
249
## 1.0.3 (2021-07-07)
250
251
### Added
252
- Structured annotations
253
- Some unit tests
254
255
### Fixed
256
- Error with outdated unicode package
257
- Bug with context
258
259
## 1.0.2 
260
Release to PyPI
261
262
## 1.0.1 
263
Small bugfix for None as input
264
265
## 1.0.0 
266
Initial version