Diff of /CHANGELOG.md [000000] .. [79668b]

Switch to side-by-side view

--- a
+++ b/CHANGELOG.md
@@ -0,0 +1,266 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## 3.0.3 (2024-07-16)
+
+### Added
+
+- A cache_path option, to define the path for saving/loading the lookup structure cache. You should use this if your install directory is not writable.
+
+### Removed
+
+- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts
+- old lookup list names, e.g. `prefixes` now replaced by `prefix`
+- annotator types `custom`, `regexp`, `token_pattern`, `dd_token_pattern` and `annotation_context`, all replaced by setting class directly as `annotator_type` 
+- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator`
+
+## 3.0.2 (2024-02-15)
+
+### Changed
+- recognize 4+ spaces as a token, blocking annotations
+
+## 3.0.1 (2023-12-20)
+
+### Fixed
+- a bug with packaging `base_config.json`
+
+## 3.0.0 (2023-12-20)
+
+### Added
+- speed optimizations, ~250%
+- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)
+- `PatientNameAnnotator`, which replaces `deduce.pattern`
+- a structured way for loading and building lookup structures (lists and tries), including caching
+- `pre_match_words` for some regexp annotators, speeding up the annotating
+- option to present a user config as dict (using `config` keyword)
+
+### Changed
+- speedup for `TokenPatternAnnotator`
+- some internals of `ContextPatternAnnotator`
+- initials now detected by lookup list, rather than pattern
+- redactor open and close chars from `<` `>` to `[` `]`, as previous chars caused issues in html (so deidentified text now shows `[PATIENT]`, `[LOCATIE]`, etc.)
+- names of lookup structures to singular (`prefix`, rather than `prefixes`)
+- `INSTELLING` tag to `ZIEKENHUIS` and `ZORGINSTELLING`
+- refactored and simplified annotator loading, specifically the `annotator_type` config keyword now accepts references to classes (e.g `deduce.annotator.TokenPatternAnnotator`)
+- renamed `interfix_with_capital` annotator to `interfix_with_name` 
+
+### Deprecated
+- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts
+- old lookup list names, e.g. `prefixes` now replaced by `prefix`
+- annotator types `custom`, `regexp`, `token_pattern`, `dd_token_pattern` and `annotation_context`, all replaced by setting class directly as `annotator_type` 
+- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator`
+
+### Removed
+- automated coverage reporting on coveralls.io
+- options `lowercase_lookup`, `lowercase_neg_lookup` for token patterns
+- `utils.any_in_text` 
+
+### Fixed
+- some small additions/removals for specific lookup lists
+- smaller bugs related to overlapping matches
+
+## 2.5.0 (2023-11-28)
+
+### Added
+- the `RegexpPseudoAnnotator` component for filtering regexp matches based on preceding/following words
+- a `prefix_with_interfix` pattern for names, detecting e.g. `Dr. van Loon`
+
+### Changed
+- the age detection component, with improved logic and pseudo patterns
+- annotations are no longer counted adjacent when separated by a comma
+- streets are prioritized over names when merging overlapping annotations
+- removed some false positives for postal codes ending in `gr` or `ie`
+- extended the postbus pattern for `xx.xxx` format (old notation)
+- some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists
+
+### Fixed
+- a bug with `BsnAnnotator` with non-digit characters in regexp
+
+## 2.4.2 (2023-11-22)
+
+### Changed
+- multi-token lookup for first- and last names, so multi token names are now detected
+- some small lookup list additions
+
+## 2.4.3 (2023-11-22)
+
+### Changed
+- extended list of medical terms
+
+## 2.4.2 (2023-11-21)
+
+### Changed
+- name lookup list contents, extending names and adding more exceptions
+
+## 2.4.1 (2023-11-15)
+
+### Added
+- detection of initials `Ch.`, `Chr.`, `Ph.` and `Th.` 
+
+## 2.4.0 (2023-11-15)
+
+### Added
+- logic for detecting hospitals, with added whitelist and separate annotator
+
+### Changed
+- logic for detecting (non-hospital) institutions, with extended lookup list
+
+### Removed
+- the separate Altrecht annotator, now included in the lookup list
+
+## 2.3.1 (2023-11-01)
+
+### Fixed
+- include data files recursively in package
+
+## 2.3.0 (2023-10-25)
+
+### Added
+- lookup lists (and logic) for Dutch provinces, regions, municipalities and streets
+
+### Changed
+- name of `residences` annotator to `placenames`, now includes provinces, regions and municipalities
+- lookup lists (and logic) for residences
+- logic for streets, housenumber and housenumber letters
+
+## 2.2.0 (2023-09-28)
+
+### Changed
+- tokenizer logic: 
+  - a token is now a sequence of alphanumeric characters, a single newline, or a single special character. 
+  - whitespaces are no longer considered tokens
+- moved token pattern logic to config, using a new `TokenPatternAnnotator`
+- moved context pattern logic to config, using a new `ContextAnnotator`
+- many updates to name detection logic
+  - lookup list optimizations
+  - added, removed and simplified patterns
+
+## 2.1.0 (2023-08-07)
+
+### Added
+- a component for deidentifying BSN-nummers
+
+### Changed
+- updated dependencies
+- by default, deduce now recognizes and tags bsn nummers
+- by default, deduce now recognizes all other 7+ digit numbers as identifiers
+- improved regular expressions for e-mail address and url matching, with separate tags
+- logic for detecting phone numbers (improvements for hyphens, whitespaces, false positive identifiers)
+- improved regular expression for age matching
+- date detection logic:
+  - now only recognizes combinations of day, month and year (day/month combinations caused many false positives)
+  - detects year-month-day format in addition to (day-month-year)
+- loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config
+
+### Deprecated
+- backwards compatibility, which was temporary added to transition from v1 to v2
+
+### Removed
+- a separate patient identifier tag, now superseded by a generic tag
+- detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores) 
+
+### Fixed
+- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)
+
+## 2.0.3 (2023-04-06)
+
+### Fixed
+- removed 'decibutus' from list of institutions as it caused many false positives
+
+## 2.0.2 (2023-03-28)
+
+### Changed
+- upgraded dependencies, including `markdown-it-py` which had a vulnerability
+
+## 2.0.1 (2022-12-09)
+
+### Changed
+- upgraded dependencies
+
+## 2.0.0 (2022-12-05)
+
+### Added
+- introduced new interface for deidentification, using `Deduce()` class
+- a separate documentation page, with tutorial and migration guide
+- support for python 3.10 and 3.11
+
+### Changed
+- major refactor that touches pretty much every line of code
+- use `docdeid` package for logic
+- speedups: now 973% faster
+- use lookup sets instead of lookup lists
+- refactor tokenizer
+- refactor annotators into separate classes, using structured annotations
+- guidelines for contributing
+
+### Removed
+- the `annotate_text` and `deidentify_annotations` functions
+- all in-text annotation (under the hood) and associated functions
+- support for given names. given names can be added as another first name in the `Person` class. 
+- support for python 3.7 and 3.8
+
+### Fixed
+- `<` and `>` are no longer replaced by `(` and `)` respectively
+- deduce does not strip text (whitespaces, tabs at beginning/end of text) anymore
+
+## 1.0.8 (2021-11-29)
+
+### Added
+- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation
+
+### Fixed
+- various modifications related to adding or subtracting spaces in annotated texts
+- remove the lowercasing of institutions' names
+- therefore, all structured annotations have texts matching the original text in the same span
+
+## 1.0.7 (2021-11-03)
+
+### Changed
+- Internal code formatting improvements
+
+### Added
+- Contributing guidelines
+
+## 1.0.6 (2021-10-06)
+
+### Fixed
+- Bug with multiple 4-digit mg dosages in one text
+
+## 1.0.5 (2021-10-05)
+
+### Fixed
+- Minor bug where tag flattening had no effect
+
+## 1.0.4 (2021-10-05)
+
+### Added
+- Changelog
+- Additional unit tests for whitespace/punctuation
+
+### Fixed
+- Various whitespace/punctuation bugs
+- Bug with nested tags not related to person names
+- Bug with adjacent tags not being merged
+
+## 1.0.3 (2021-07-07)
+
+### Added
+- Structured annotations
+- Some unit tests
+
+### Fixed
+- Error with outdated unicode package
+- Bug with context
+
+## 1.0.2 
+Release to PyPI
+
+## 1.0.1 
+Small bugfix for None as input
+
+## 1.0.0 
+Initial version