|
a |
|
b/README.md |
|
|
1 |
[](https://github.com/vmenger/deduce/actions/workflows/test.yml) |
|
|
2 |
[](https://github.com/vmenger/deduce/actions/workflows/build.yml) |
|
|
3 |
[](https://deduce.readthedocs.io/en/latest/?badge=latest) |
|
|
4 |
 |
|
|
5 |
 |
|
|
6 |
 |
|
|
7 |
 |
|
|
8 |
[](https://github.com/psf/black) |
|
|
9 |
|
|
|
10 |
# deduce |
|
|
11 |
|
|
|
12 |
> Deduce 3.0.0 is out! It is way more accurate, and faster too. It's fully backward compatible, but some functionality is scheduled for removal, read more about it here: [docs/migrating-to-v3](https://deduce.readthedocs.io/en/latest/migrating.html) |
|
|
13 |
|
|
|
14 |
<!-- start include in docs --> |
|
|
15 |
|
|
|
16 |
* :sparkles: Remove sensitive information from clinical text written in Dutch |
|
|
17 |
* :mag: Rule based logic for detecting e.g. names, locations, institutions, identifiers, phone numbers |
|
|
18 |
* :triangular_ruler: Useful out of the box, but customization higly recommended |
|
|
19 |
* :seedling: Originally validated in [Menger et al. (2017)](http://www.sciencedirect.com/science/article/pii/S0736585316307365), but further optimized since |
|
|
20 |
|
|
|
21 |
> :exclamation: Deduce is useful out of the box, but please validate and customize on your own data before using it in a critical environment. Remember that de-identification is almost never perfect, and that clinical text often contains other specific details that can link it to a specific person. Be aware that de-identification should primarily be viewed as a way to mitigate risk of identification, rather than a way to obtain anonymous data. |
|
|
22 |
|
|
|
23 |
Currently, `deduce` can remove the following types of Protected Health Information (PHI): |
|
|
24 |
|
|
|
25 |
* :bust_in_silhouette: person names, including prefixes and initials |
|
|
26 |
* :earth_americas: geographical locations smaller than a country |
|
|
27 |
* :hospital: names of hospitals and healthcare institutions |
|
|
28 |
* :calendar: dates (combinations of day, month and year) |
|
|
29 |
* :birthday: ages |
|
|
30 |
* :1234: BSN numbers |
|
|
31 |
* :1234: identifiers (7+ digits without a specific format, e.g. patient identifiers, AGB, BIG) |
|
|
32 |
* :phone: phone numbers |
|
|
33 |
* :e-mail: e-mail addresses |
|
|
34 |
* :link: URLs |
|
|
35 |
|
|
|
36 |
## Citing |
|
|
37 |
|
|
|
38 |
If you use `deduce`, please cite the following paper: |
|
|
39 |
|
|
|
40 |
[Menger, V.J., Scheepers, F., van Wijk, L.M., Spruit, M. (2017). DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text, Telematics and Informatics, 2017, ISSN 0736-5853](http://www.sciencedirect.com/science/article/pii/S0736585316307365) |
|
|
41 |
|
|
|
42 |
<!-- end include in docs --> |
|
|
43 |
|
|
|
44 |
<!-- start getting started --> |
|
|
45 |
|
|
|
46 |
## Installation |
|
|
47 |
|
|
|
48 |
``` python |
|
|
49 |
pip install deduce |
|
|
50 |
``` |
|
|
51 |
|
|
|
52 |
## Getting started |
|
|
53 |
|
|
|
54 |
The basic way to use `deduce`, is to pass text to the `deidentify` method of a `Deduce` object: |
|
|
55 |
|
|
|
56 |
```python |
|
|
57 |
from deduce import Deduce |
|
|
58 |
|
|
|
59 |
deduce = Deduce() |
|
|
60 |
|
|
|
61 |
text = ( |
|
|
62 |
"betreft: Jan Jansen, bsn 111222333, patnr 000334433. De patient J. Jansen is 64 jaar oud en woonachtig in " |
|
|
63 |
"Utrecht. Hij werd op 10 oktober 2018 door arts Peter de Visser ontslagen van de kliniek van het UMCU. " |
|
|
64 |
"Voor nazorg kan hij worden bereikt via j.JNSEN.123@gmail.com of (06)12345678." |
|
|
65 |
) |
|
|
66 |
|
|
|
67 |
doc = deduce.deidentify(text) |
|
|
68 |
``` |
|
|
69 |
|
|
|
70 |
The output is available in the `Document` object: |
|
|
71 |
|
|
|
72 |
```python |
|
|
73 |
from pprint import pprint |
|
|
74 |
|
|
|
75 |
pprint(doc.annotations) |
|
|
76 |
|
|
|
77 |
AnnotationSet({ |
|
|
78 |
Annotation(text="(06)12345678", start_char=272, end_char=284, tag="telefoonnummer"), |
|
|
79 |
Annotation(text="111222333", start_char=25, end_char=34, tag="bsn"), |
|
|
80 |
Annotation(text="Peter de Visser", start_char=153, end_char=168, tag="persoon"), |
|
|
81 |
Annotation(text="j.JNSEN.123@gmail.com", start_char=247, end_char=268, tag="email"), |
|
|
82 |
Annotation(text="patient J. Jansen", start_char=56, end_char=73, tag="patient"), |
|
|
83 |
Annotation(text="Jan Jansen", start_char=9, end_char=19, tag="patient"), |
|
|
84 |
Annotation(text="10 oktober 2018", start_char=127, end_char=142, tag="datum"), |
|
|
85 |
Annotation(text="64", start_char=77, end_char=79, tag="leeftijd"), |
|
|
86 |
Annotation(text="000334433", start_char=42, end_char=51, tag="id"), |
|
|
87 |
Annotation(text="Utrecht", start_char=106, end_char=113, tag="locatie"), |
|
|
88 |
Annotation(text="UMCU", start_char=202, end_char=206, tag="instelling"), |
|
|
89 |
}) |
|
|
90 |
|
|
|
91 |
print(doc.deidentified_text) |
|
|
92 |
|
|
|
93 |
"""betreft: [PERSOON-1], bsn [BSN-1], patnr [ID-1]. De [PERSOON-1] is [LEEFTIJD-1] jaar oud en woonachtig in |
|
|
94 |
[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1]. |
|
|
95 |
Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1].""" |
|
|
96 |
``` |
|
|
97 |
|
|
|
98 |
Additionally, if the names of the patient are known, they may be added as `metadata`, where they will be picked up by `deduce`: |
|
|
99 |
|
|
|
100 |
```python |
|
|
101 |
from deduce.person import Person |
|
|
102 |
|
|
|
103 |
patient = Person(first_names=["Jan"], initials="JJ", surname="Jansen") |
|
|
104 |
doc = deduce.deidentify(text, metadata={'patient': patient}) |
|
|
105 |
|
|
|
106 |
print (doc.deidentified_text) |
|
|
107 |
|
|
|
108 |
"""betreft: [PATIENT], bsn [BSN-1], patnr [ID-1]. De [PATIENT] is [LEEFTIJD-1] jaar oud en woonachtig in |
|
|
109 |
[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1]. |
|
|
110 |
Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1].""" |
|
|
111 |
``` |
|
|
112 |
|
|
|
113 |
As you can see, adding known names keeps references to `[PATIENT]` in text. It also increases recall, as not all known names are contained in the lookup lists. |
|
|
114 |
|
|
|
115 |
<!-- end getting started --> |
|
|
116 |
|
|
|
117 |
## Versions |
|
|
118 |
|
|
|
119 |
For most cases the latest version is suitable, but some specific milestones are: |
|
|
120 |
|
|
|
121 |
* `3.0.0` - Many optimizations in accuracy, smaller refactors, further speedups |
|
|
122 |
* `2.0.0` - Major refactor, with speedups, many new options for customizing, functionally very similar to original |
|
|
123 |
* `1.0.8` - Small bugfixes compared to original release |
|
|
124 |
* `1.0.1` - Original release with [Menger et al. (2017)](http://www.sciencedirect.com/science/article/pii/S0736585316307365) |
|
|
125 |
|
|
|
126 |
Detailed versioning information is accessible in the [changelog](CHANGELOG.md). |
|
|
127 |
|
|
|
128 |
## Documentation |
|
|
129 |
|
|
|
130 |
All documentation, including a more extensive tutorial on using, configuring and modifying `deduce`, and its API, is available at: [docs/tutorial](https://deduce.readthedocs.io/en/latest/) |
|
|
131 |
|
|
|
132 |
## Contributing |
|
|
133 |
|
|
|
134 |
For setting up the dev environment and contributing guidelines, see: [docs/contributing](https://deduce.readthedocs.io/en/latest/contributing.html) |
|
|
135 |
|
|
|
136 |
## Authors |
|
|
137 |
|
|
|
138 |
* **Vincent Menger** - *Initial work* |
|
|
139 |
* **Jonathan de Bruin** - *Code review* |
|
|
140 |
* **Pablo Mosteiro** - *Bug fixes, structured annotations* |
|
|
141 |
|
|
|
142 |
## License |
|
|
143 |
|
|
|
144 |
This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details |