Switch to unified view

a/README.md b/README.md
1
# deidentify
1
# deidentify
2
2
3
A Python library to de-identify medical records with state-of-the-art NLP methods. Pre-trained models for the Dutch language are available.
3
A Python library to de-identify medical records with state-of-the-art NLP methods. Pre-trained models for the Dutch language are available.
4
4
5
This repository shares the resources developed in the following paper:
5
This repository shares the resources developed in the following paper:
6
6
7
> J. Trienes, D. Trieschnigg, C. Seifert, and D. Hiemstra. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In: *Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop (HSDM)*, 2020.
7
 J. Trienes, D. Trieschnigg, C. Seifert, and D. Hiemstra. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In: *Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop (HSDM)*, 2020.
8
8
9
Read more about the work in our [paper](https://arxiv.org/abs/2001.05714) or [blog post](https://medium.com/nedap/de-identification-of-ehr-using-nlp-a270d40fc442).
9
Read more about the work in our [paper](https://arxiv.org/abs/2001.05714) or [blog post](https://medium.com/nedap/de-identification-of-ehr-using-nlp-a270d40fc442).
10
10
11
## Quick Start
11
## Quick Start
12
12
13
### Installation
13
### Installation
14
14
15
Create a new virtual environment with an environment manager of your choice. Then, install `deidentify`:
15
Create a new virtual environment with an environment manager of your choice. Then, install `deidentify`:
16
16
17
```sh
17
```sh
18
pip install deidentify
18
pip install deidentify
19
```
19
```
20
20
21
We use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same spaCy version that we used to train the de-identification models.
21
We use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same spaCy version that we used to train the de-identification models.
22
22
23
```sh
23
```sh
24
pip install -U "spacy<3" https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.3.0/nl_core_news_sm-2.3.0.tar.gz#egg=nl_core_news_sm==2.3.0
24
pip install -U "spacy<3" https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.3.0/nl_core_news_sm-2.3.0.tar.gz#egg=nl_core_news_sm==2.3.0
25
```
25
```
26
26
27
### Example Usage
27
### Example Usage
28
28
29
The code below shows how to apply a pre-trained de-identification pipeline to an example document. We provide a [list of available models](#pre-trained-models) below.
29
The code below shows how to apply a pre-trained de-identification pipeline to an example document. We provide a [list of available models](#pre-trained-models) below.
30
30
31
```py
31
```py
32
from deidentify.base import Document
32
from deidentify.base import Document
33
from deidentify.taggers import FlairTagger
33
from deidentify.taggers import FlairTagger
34
from deidentify.tokenizer import TokenizerFactory
34
from deidentify.tokenizer import TokenizerFactory
35
35
36
# Create some text
36
# Create some text
37
text = (
37
text = (
38
    "Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: "
38
    "Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: "
39
    "j.jnsen@email.com, t: 06-12345678) is 64 jaar oud en woonachtig in Utrecht. Hij werd op 10 "
39
    "j.jnsen@email.com, t: 06-12345678) is 64 jaar oud en woonachtig in Utrecht. Hij werd op 10 "
40
    "oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU."
40
    "oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU."
41
)
41
)
42
42
43
# Wrap text in document
43
# Wrap text in document
44
documents = [
44
documents = [
45
    Document(name='doc_01', text=text)
45
    Document(name='doc_01', text=text)
46
]
46
]
47
47
48
# Select downloaded model
48
# Select downloaded model
49
model = 'model_bilstmcrf_ons_fast-v0.2.0'
49
model = 'model_bilstmcrf_ons_fast-v0.2.0'
50
50
51
# Instantiate tokenizer
51
# Instantiate tokenizer
52
tokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=("tagger", "ner"))
52
tokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=("tagger", "ner"))
53
53
54
# Load tagger with a downloaded model file and tokenizer
54
# Load tagger with a downloaded model file and tokenizer
55
tagger = FlairTagger(model=model, tokenizer=tokenizer, verbose=False)
55
tagger = FlairTagger(model=model, tokenizer=tokenizer, verbose=False)
56
56
57
# Annotate your documents
57
# Annotate your documents
58
annotated_docs = tagger.annotate(documents)
58
annotated_docs = tagger.annotate(documents)
59
```
59
```
60
60
61
This completes the annotation stage. Let's inspect the entities that the tagger found:
61
This completes the annotation stage. Let's inspect the entities that the tagger found:
62
62
63
```py
63
```py
64
from pprint import pprint
64
from pprint import pprint
65
65
66
first_doc = annotated_docs[0]
66
first_doc = annotated_docs[0]
67
pprint(first_doc.annotations)
67
pprint(first_doc.annotations)
68
```
68
```
69
69
70
This should print the entities of the first document.
70
This should print the entities of the first document.
71
71
72
```py
72
```py
73
[Annotation(text='Jan Jansen', start=39, end=49, tag='Name', doc_id='', ann_id='T0'),
73
[Annotation(text='Jan Jansen', start=39, end=49, tag='Name', doc_id='', ann_id='T0'),
74
 Annotation(text='J. Jansen', start=62, end=71, tag='Name', doc_id='', ann_id='T1'),
74
 Annotation(text='J. Jansen', start=62, end=71, tag='Name', doc_id='', ann_id='T1'),
75
 Annotation(text='j.jnsen@email.com', start=76, end=93, tag='Email', doc_id='', ann_id='T2'),
75
 Annotation(text='j.jnsen@email.com', start=76, end=93, tag='Email', doc_id='', ann_id='T2'),
76
 Annotation(text='06-12345678', start=98, end=109, tag='Phone_fax', doc_id='', ann_id='T3'),
76
 Annotation(text='06-12345678', start=98, end=109, tag='Phone_fax', doc_id='', ann_id='T3'),
77
 Annotation(text='64 jaar', start=114, end=121, tag='Age', doc_id='', ann_id='T4'),
77
 Annotation(text='64 jaar', start=114, end=121, tag='Age', doc_id='', ann_id='T4'),
78
 Annotation(text='Utrecht', start=143, end=150, tag='Address', doc_id='', ann_id='T5'),
78
 Annotation(text='Utrecht', start=143, end=150, tag='Address', doc_id='', ann_id='T5'),
79
 Annotation(text='10 oktober', start=164, end=174, tag='Date', doc_id='', ann_id='T6'),
79
 Annotation(text='10 oktober', start=164, end=174, tag='Date', doc_id='', ann_id='T6'),
80
 Annotation(text='Peter de Visser', start=185, end=200, tag='Name', doc_id='', ann_id='T7'),
80
 Annotation(text='Peter de Visser', start=185, end=200, tag='Name', doc_id='', ann_id='T7'),
81
 Annotation(text='UMCU', start=234, end=238, tag='Hospital', doc_id='', ann_id='T8')]
81
 Annotation(text='UMCU', start=234, end=238, tag='Hospital', doc_id='', ann_id='T8')]
82
```
82
```
83
83
84
#### Mask Annotations
84
#### Mask Annotations
85
85
86
Use masking to replace annotations with placeholders. Example: `Jan Jansen -> [NAME]`
86
Use masking to replace annotations with placeholders. Example: `Jan Jansen -> [NAME]`
87
87
88
```py
88
```py
89
from deidentify.util import mask_annotations
89
from deidentify.util import mask_annotations
90
90
91
masked_doc = mask_annotations(first_doc)
91
masked_doc = mask_annotations(first_doc)
92
print(masked_doc.text)
92
print(masked_doc.text)
93
```
93
```
94
94
95
Which should print:
95
Which should print:
96
96
97
> Dit is stukje tekst met daarin de naam [NAME]. De patient [NAME] (e: [EMAIL], t: [PHONE_FAX]) is [AGE] oud en woonachtig in [ADDRESS]. Hij werd op [DATE] door arts [NAME] ontslagen van de kliniek van het [HOSPITAL].
97
 Dit is stukje tekst met daarin de naam [NAME]. De patient [NAME] (e: [EMAIL], t: [PHONE_FAX]) is [AGE] oud en woonachtig in [ADDRESS]. Hij werd op [DATE] door arts [NAME] ontslagen van de kliniek van het [HOSPITAL].
98
98
99
#### Replace Annotations with Surrogates [experimental]
99
#### Replace Annotations with Surrogates [experimental]
100
100
101
Use sorrogate generation to replace annotations with random but realistic alternatives. Example: `Jan Jansen -> Bart Bakker`. The surrogate replacement strategy follows [Stubbs et al. (2015)](https://doi.org/10.1007/978-3-319-23633-9_27).
101
Use sorrogate generation to replace annotations with random but realistic alternatives. Example: `Jan Jansen -> Bart Bakker`. The surrogate replacement strategy follows [Stubbs et al. (2015)](https://doi.org/10.1007/978-3-319-23633-9_27).
102
102
103
```py
103
```py
104
from deidentify.util import surrogate_annotations
104
from deidentify.util import surrogate_annotations
105
105
106
# The surrogate generation process involves some randomness.
106
# The surrogate generation process involves some randomness.
107
# You can set a seed to make the process deterministic.
107
# You can set a seed to make the process deterministic.
108
iter_docs = surrogate_annotations(docs=[first_doc], seed=1)
108
iter_docs = surrogate_annotations(docs=[first_doc], seed=1)
109
surrogate_doc = list(iter_docs)[0]
109
surrogate_doc = list(iter_docs)[0]
110
print(surrogate_doc.text)
110
print(surrogate_doc.text)
111
```
111
```
112
112
113
This code should print:
113
This code should print:
114
114
115
> Dit is stukje tekst met daarin de naam Gijs Hermelink. De patient G. Hermelink (e: n.qvgjj@spqms.com, t: 06-83662585) is 64 jaar oud en woonachtig in Cothen. Hij werd op 28 juni door arts Jullian van Troost ontslagen van de kliniek van het UMCU.
115
 Dit is stukje tekst met daarin de naam Gijs Hermelink. De patient G. Hermelink (e: n.qvgjj@spqms.com, t: 06-83662585) is 64 jaar oud en woonachtig in Cothen. Hij werd op 28 juni door arts Jullian van Troost ontslagen van de kliniek van het UMCU.
116
116
117
### Available Taggers
117
### Available Taggers
118
118
119
There are currently three taggers that you can use:
119
There are currently three taggers that you can use:
120
120
121
   * `DeduceTagger`: A wrapper around the DEDUCE tagger by Menger et al. (2018, [code](https://github.com/vmenger/deduce), [paper](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365))
121
   * `DeduceTagger`: A wrapper around the DEDUCE tagger by Menger et al. (2018, [code](https://github.com/vmenger/deduce), [paper](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365))
122
   * `CRFTagger`: A CRF tagger using the feature set by Liu et al. (2015, [paper](https://www.sciencedirect.com/science/article/pii/S1532046415001197))
122
   * `CRFTagger`: A CRF tagger using the feature set by Liu et al. (2015, [paper](https://www.sciencedirect.com/science/article/pii/S1532046415001197))
123
   * `FlairTagger`: A wrapper around the Flair [`SequenceTagger`](https://github.com/zalandoresearch/flair/blob/2d6e89bdfe05644b4e5c7e8327f6ecc6b834ec9e/flair/models/sequence_tagger_model.py#L68) allowing the use of neural architectures such as BiLSTM-CRF. The pre-trained models below use contextualized string embeddings by Akbik et al. (2018, [paper](https://www.aclweb.org/anthology/C18-1139/))
123
   * `FlairTagger`: A wrapper around the Flair [`SequenceTagger`](https://github.com/zalandoresearch/flair/blob/2d6e89bdfe05644b4e5c7e8327f6ecc6b834ec9e/flair/models/sequence_tagger_model.py#L68) allowing the use of neural architectures such as BiLSTM-CRF. The pre-trained models below use contextualized string embeddings by Akbik et al. (2018, [paper](https://www.aclweb.org/anthology/C18-1139/))
124
124
125
All taggers implement the `deidentify.taggers.TextTagger` interface which you can implement to provide your own taggers.
125
All taggers implement the `deidentify.taggers.TextTagger` interface which you can implement to provide your own taggers.
126
126
127
### Tag Set
127
### Tag Set
128
128
129
Use the `TextTagger.tags` to get a list of supported tags. For the `FlairTagger` in above demo this looks as follows:
129
Use the `TextTagger.tags` to get a list of supported tags. For the `FlairTagger` in above demo this looks as follows:
130
130
131
```py
131
```py
132
>>> tagger.tags
132
>>> tagger.tags
133
['Internal_Location', 'Age', 'Phone_fax', 'Name', 'SSN', 'Hospital', 'Email', 'Initials', 'O',
133
['Internal_Location', 'Age', 'Phone_fax', 'Name', 'SSN', 'Hospital', 'Email', 'Initials', 'O',
134
'Organization_Company', 'ID', 'Profession', 'Care_Institute', 'Other', 'Date', 'URL_IP', 'Address']
134
'Organization_Company', 'ID', 'Profession', 'Care_Institute', 'Other', 'Date', 'URL_IP', 'Address']
135
```
135
```
136
136
137
### Pre-trained Models
137
### Pre-trained Models
138
138
139
We provide a number of pre-trained models for the Dutch language. The models were developed on the Nedap/University of Twente (NUT) dataset. The dataset consists of 1260 documents from three domains of Dutch healthcare: elderly care, mental care and disabled care (note: in the codebase we sometimes also refer to this dataset as `ons`). More information on the design of the dataset can be found in [our paper](https://arxiv.org/abs/2001.05714).
139
We provide a number of pre-trained models for the Dutch language. The models were developed on the Nedap/University of Twente (NUT) dataset. The dataset consists of 1260 documents from three domains of Dutch healthcare: elderly care, mental care and disabled care (note: in the codebase we sometimes also refer to this dataset as `ons`). More information on the design of the dataset can be found in [our paper](https://arxiv.org/abs/2001.05714).
140
140
141
141
142
| Name | Tagger | Lang | Dataset | F1* | Precision* | Recall* | Tags |
142
| Name | Tagger | Lang | Dataset | F1* | Precision* | Recall* | Tags |
143
|------|--------|----------|---------|----|-----------|--------|--------|
143
|------|--------|----------|---------|----|-----------|--------|--------|
144
| [DEDUCE (Menger et al., 2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365)** | `DeduceTagger` | NL | NUT | 0.6649 | 0.8192 | 0.5595 | [8 PHI Tags](https://github.com/nedap/deidentify/blob/168ad67aec586263250900faaf5a756d3b8dd6fa/deidentify/methods/deduce/run_deduce.py#L17) |
144
| [DEDUCE (Menger et al., 2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365)** | `DeduceTagger` | NL | NUT | 0.6649 | 0.8192 | 0.5595 | [8 PHI Tags](https://github.com/nedap/deidentify/blob/168ad67aec586263250900faaf5a756d3b8dd6fa/deidentify/methods/deduce/run_deduce.py#L17) |
145
| [model_crf_ons_tuned-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) | `CRFTagger` | NL | NUT | 0.8511 | 0.9337 | 0.7820 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) |
145
| [model_crf_ons_tuned-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) | `CRFTagger` | NL | NUT | 0.8511 | 0.9337 | 0.7820 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) |
146
| [model_bilstmcrf_ons_fast-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) | `FlairTagger`  | NL | NUT | 0.8914 | 0.9101 | 0.8735 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) |
146
| [model_bilstmcrf_ons_fast-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) | `FlairTagger`  | NL | NUT | 0.8914 | 0.9101 | 0.8735 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) |
147
| [model_bilstmcrf_ons_large-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) | `FlairTagger` | NL | NUT | 0.8990 | 0.9240 | 0.8754 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) |
147
| [model_bilstmcrf_ons_large-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) | `FlairTagger` | NL | NUT | 0.8990 | 0.9240 | 0.8754 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) |
148
148
149
*\*All scores are micro-averaged entity-level precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.*
149
*\*All scores are micro-averaged entity-level precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.*
150
150
151
*\*\*DEDUCE was developed on a dataset of psychiatric nursing notes and treatment plans. The numbers reported here were obtained by applying DEDUCE to our NUT dataset. For more information on the development of DEDUCE, see the paper by [Menger et al. (2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365).*
151
*\*\*DEDUCE was developed on a dataset of psychiatric nursing notes and treatment plans. The numbers reported here were obtained by applying DEDUCE to our NUT dataset. For more information on the development of DEDUCE, see the paper by [Menger et al. (2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365).*
152
152
153
## Running Experiments and Training Models
153
## Running Experiments and Training Models
154
154
155
If you have your own dataset of annotated documents and you want to train your own models on it, you can take a look at the following guides:
155
If you have your own dataset of annotated documents and you want to train your own models on it, you can take a look at the following guides:
156
156
157
   * [Convert your data into our corpus format](docs/01_data_format.md)
157
   * [Convert your data into our corpus format](docs/01_data_format.md)
158
   * [Train and evaluate your own models](docs/02_train_evaluate_models.md)
158
   * [Train and evaluate your own models](docs/02_train_evaluate_models.md)
159
   * [Logging and pipeline verbosity](docs/06_pipeline_verbosity.md)
159
   * [Logging and pipeline verbosity](docs/06_pipeline_verbosity.md)
160
160
161
If you want more information on the experiments in our paper, have a look here:
161
If you want more information on the experiments in our paper, have a look here:
162
162
163
   * [NUT annotation guidelines](docs/03_hsdm2020_nut_annotation_guidelines.md)
163
   * [NUT annotation guidelines](docs/03_hsdm2020_nut_annotation_guidelines.md)
164
   * [Surrogate generation procedure](docs/04_hsdm2020_surrogate_generation.md)
164
   * [Surrogate generation procedure](docs/04_hsdm2020_surrogate_generation.md)
165
   * [Experiments on English corpora: i2b2/UTHealth and nursing notes](docs/05_hsdm2020_english_datasets.md)
165
   * [Experiments on English corpora: i2b2/UTHealth and nursing notes](docs/05_hsdm2020_english_datasets.md)
166
166
167
### Computational Environment
167
### Computational Environment
168
168
169
When you want to run your own experiments, we assume that you clone this code base locally and execute all scripts under `deidentify/` within the following conda environment:
169
When you want to run your own experiments, we assume that you clone this code base locally and execute all scripts under `deidentify/` within the following conda environment:
170
170
171
```sh
171
```sh
172
# Install package dependencies and add local files to the Python path of that environment.
172
# Install package dependencies and add local files to the Python path of that environment.
173
conda env create -f environment.yml
173
conda env create -f environment.yml
174
conda activate deidentify && export PYTHONPATH="${PYTHONPATH}:$(pwd)"
174
conda activate deidentify && export PYTHONPATH="${PYTHONPATH}:$(pwd)"
175
```
175
```
176
176
177
## Citation
177
## Citation
178
178
179
Please cite the following paper when using `deidentify`:
179
Please cite the following paper when using `deidentify`:
180
180
181
```bibtex
181
```bibtex
182
@inproceedings{Trienes:2020:CRF,
182
@inproceedings{Trienes:2020:CRF,
183
  title={Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records},
183
  title={Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records},
184
  author={Trienes, Jan and Trieschnigg, Dolf and Seifert, Christin and Hiemstra, Djoerd},
184
  author={Trienes, Jan and Trieschnigg, Dolf and Seifert, Christin and Hiemstra, Djoerd},
185
  booktitle = {Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop},
185
  booktitle = {Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop},
186
  series = {{HSDM} 2020},
186
  series = {{HSDM} 2020},
187
  year = {2020}
187
  year = {2020}
188
}
188
}
189
```
189
```
190
190
191
## Contact
191
## Contact
192
192
193
If you have any question, please contact Jan Trienes at jan.trienes@gmail.com.
193
If you have any question, please contact Jan Trienes at jan.trienes@gmail.com.