Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
3.0.3 (2024-07-16)
Added
- A cache_path option, to define the path for saving/loading the lookup structure cache. You should use this if your install directory is not writable.
Removed
- the
config_file
keyword, now replaced by config
which accepts both filenames and dicts
- old lookup list names, e.g.
prefixes
now replaced by prefix
- annotator types
custom
, regexp
, token_pattern
, dd_token_pattern
and annotation_context
, all replaced by setting class directly as annotator_type
- everything in
deduce.pattern
, patient patterns now replaced by PatientNameAnnotator
3.0.2 (2024-02-15)
Changed
- recognize 4+ spaces as a token, blocking annotations
3.0.1 (2023-12-20)
Fixed
- a bug with packaging
base_config.json
3.0.0 (2023-12-20)
Added
- speed optimizations, ~250%
- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)
PatientNameAnnotator
, which replaces deduce.pattern
- a structured way for loading and building lookup structures (lists and tries), including caching
pre_match_words
for some regexp annotators, speeding up the annotating
- option to present a user config as dict (using
config
keyword)
Changed
- speedup for
TokenPatternAnnotator
- some internals of
ContextPatternAnnotator
- initials now detected by lookup list, rather than pattern
- redactor open and close chars from
<
>
to [
]
, as previous chars caused issues in html (so deidentified text now shows [PATIENT]
, [LOCATIE]
, etc.)
- names of lookup structures to singular (
prefix
, rather than prefixes
)
INSTELLING
tag to ZIEKENHUIS
and ZORGINSTELLING
- refactored and simplified annotator loading, specifically the
annotator_type
config keyword now accepts references to classes (e.g deduce.annotator.TokenPatternAnnotator
)
- renamed
interfix_with_capital
annotator to interfix_with_name
Deprecated
- the
config_file
keyword, now replaced by config
which accepts both filenames and dicts
- old lookup list names, e.g.
prefixes
now replaced by prefix
- annotator types
custom
, regexp
, token_pattern
, dd_token_pattern
and annotation_context
, all replaced by setting class directly as annotator_type
- everything in
deduce.pattern
, patient patterns now replaced by PatientNameAnnotator
Removed
- automated coverage reporting on coveralls.io
- options
lowercase_lookup
, lowercase_neg_lookup
for token patterns
utils.any_in_text
Fixed
- some small additions/removals for specific lookup lists
- smaller bugs related to overlapping matches
2.5.0 (2023-11-28)
Added
- the
RegexpPseudoAnnotator
component for filtering regexp matches based on preceding/following words
- a
prefix_with_interfix
pattern for names, detecting e.g. Dr. van Loon
Changed
- the age detection component, with improved logic and pseudo patterns
- annotations are no longer counted adjacent when separated by a comma
- streets are prioritized over names when merging overlapping annotations
- removed some false positives for postal codes ending in
gr
or ie
- extended the postbus pattern for
xx.xxx
format (old notation)
- some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists
Fixed
- a bug with
BsnAnnotator
with non-digit characters in regexp
2.4.2 (2023-11-22)
Changed
- multi-token lookup for first- and last names, so multi token names are now detected
- some small lookup list additions
2.4.3 (2023-11-22)
Changed
- extended list of medical terms
2.4.2 (2023-11-21)
Changed
- name lookup list contents, extending names and adding more exceptions
2.4.1 (2023-11-15)
Added
- detection of initials
Ch.
, Chr.
, Ph.
and Th.
2.4.0 (2023-11-15)
Added
- logic for detecting hospitals, with added whitelist and separate annotator
Changed
- logic for detecting (non-hospital) institutions, with extended lookup list
Removed
- the separate Altrecht annotator, now included in the lookup list
2.3.1 (2023-11-01)
Fixed
- include data files recursively in package
2.3.0 (2023-10-25)
Added
- lookup lists (and logic) for Dutch provinces, regions, municipalities and streets
Changed
- name of
residences
annotator to placenames
, now includes provinces, regions and municipalities
- lookup lists (and logic) for residences
- logic for streets, housenumber and housenumber letters
2.2.0 (2023-09-28)
Changed
- tokenizer logic:
- a token is now a sequence of alphanumeric characters, a single newline, or a single special character.
- whitespaces are no longer considered tokens
- moved token pattern logic to config, using a new
TokenPatternAnnotator
- moved context pattern logic to config, using a new
ContextAnnotator
- many updates to name detection logic
- lookup list optimizations
- added, removed and simplified patterns
2.1.0 (2023-08-07)
Added
- a component for deidentifying BSN-nummers
Changed
- updated dependencies
- by default, deduce now recognizes and tags bsn nummers
- by default, deduce now recognizes all other 7+ digit numbers as identifiers
- improved regular expressions for e-mail address and url matching, with separate tags
- logic for detecting phone numbers (improvements for hyphens, whitespaces, false positive identifiers)
- improved regular expression for age matching
- date detection logic:
- now only recognizes combinations of day, month and year (day/month combinations caused many false positives)
- detects year-month-day format in addition to (day-month-year)
- loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config
Deprecated
- backwards compatibility, which was temporary added to transition from v1 to v2
Removed
- a separate patient identifier tag, now superseded by a generic tag
- detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores)
Fixed
- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)
2.0.3 (2023-04-06)
Fixed
- removed 'decibutus' from list of institutions as it caused many false positives
2.0.2 (2023-03-28)
Changed
- upgraded dependencies, including
markdown-it-py
which had a vulnerability
2.0.1 (2022-12-09)
Changed
2.0.0 (2022-12-05)
Added
- introduced new interface for deidentification, using
Deduce()
class
- a separate documentation page, with tutorial and migration guide
- support for python 3.10 and 3.11
Changed
- major refactor that touches pretty much every line of code
- use
docdeid
package for logic
- speedups: now 973% faster
- use lookup sets instead of lookup lists
- refactor tokenizer
- refactor annotators into separate classes, using structured annotations
- guidelines for contributing
Removed
- the
annotate_text
and deidentify_annotations
functions
- all in-text annotation (under the hood) and associated functions
- support for given names. given names can be added as another first name in the
Person
class.
- support for python 3.7 and 3.8
Fixed
<
and >
are no longer replaced by (
and )
respectively
- deduce does not strip text (whitespaces, tabs at beginning/end of text) anymore
1.0.8 (2021-11-29)
Added
- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation
Fixed
- various modifications related to adding or subtracting spaces in annotated texts
- remove the lowercasing of institutions' names
- therefore, all structured annotations have texts matching the original text in the same span
1.0.7 (2021-11-03)
Changed
- Internal code formatting improvements
Added
1.0.6 (2021-10-06)
Fixed
- Bug with multiple 4-digit mg dosages in one text
1.0.5 (2021-10-05)
Fixed
- Minor bug where tag flattening had no effect
1.0.4 (2021-10-05)
Added
- Changelog
- Additional unit tests for whitespace/punctuation
Fixed
- Various whitespace/punctuation bugs
- Bug with nested tags not related to person names
- Bug with adjacent tags not being merged
1.0.3 (2021-07-07)
Added
- Structured annotations
- Some unit tests
Fixed
- Error with outdated unicode package
- Bug with context
1.0.2
Release to PyPI
1.0.1
Small bugfix for None as input
1.0.0
Initial version